There is an updated version about Nutch Solr integration available at https://lucidworks.com//2010/09/10/refresh-using-nutch-with-solr/
The last time I wrote about integrating Apache Nutch with Apache Solr (about two years ago), it was quite difficult to integrate the two components – you had to apply patches, hunt down required components from various places etc. Now there is easier way.The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few.
You might also be interested in:
Why Nutch instead of a simpler Fetcher?
One possible way to implement something similar to what I present here would be to use a simpler crawler framework such as Apache Droids. But using Nutch gives you some pretty nice advantages. One of these is obviously the fact that Nutch provides a complete set of features you commonly need for a generic web search application. Another benefit of using Nutch is that it is a highly scalable and relatively feature rich crawler (this does not mean that you cannot do the same with some other framework). Nutch offers features like politeness (obeys robots.txt rules), robustness and scalability (Nutch runs on hadoop, so you can run Nutch on a single machine or on a cluster of 100 machines), quality (you can bias the crawling to fetch “important” pages first) and extendability (there are many apis you can plug in your functionality. One of the most important single feature is Nutch provides out of the box is, in my subjective opinion, a Linkdatabase. You might already know that Nutch tracks links between pages so that the relevancy of search results within a collection of interlinked documents goes well beyond the naive case where you index documents without link information and anchor texts.
The first step to get started is to download the required software components, namely Apache Solr and Nutch.
1. Download Solr version 1.3.0 or Lucidworks for Solr from Download page
2. Extract Solr package
3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)
4. Extract the Nutch package
tar xzf apache-nutch-1.0.tar.gz
5. Configure Solr
For the sake of simplicity we are going to use the example
configuration of Solr as a base.
a. Copy the provided Nutch schema from directory
apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:
b. Change schema.xml so that the stored attribute of field “content” is true.
<field name=”content” type=”text” stored=”true” indexed=”true”/>
We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case:
d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it
<requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <int name="ps">100</int> <bool hl="true"/> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler>
6. Start Solr
cd apache-solr-1.3.0/example java -jar start.jar
7. Configure Nutch
a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :
<?xml version="1.0"?> <configuration> <property> <name>http.agent.name</name> <value>nutch-solr-integration</value> </property> <property> <name>generate.max.per.host</name> <value>100</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> </configuration>
b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
replace it’s content with following:
-^(https|telnet|file|ftp|mailto): # skip some suffixes -.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # allow urls in foofactory.fi domain +^http://([a-z0-9-A-Z]*.)*lucidimagination.com/ # deny anything else -.
8. Create a seed list (the initial urls to fetch)
mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt
9. Inject seed url(s) to nutch crawldb (execute in nutch directory)
bin/nutch inject crawl/crawldb urls
10. Generate fetch list, fetch and parse content
bin/nutch generate crawl/crawldb crawl/segments
The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
Now I launch the fetcher that actually goes to get the content:
bin/nutch fetch $SEGMENT -noParsing
Next I parse the content:
bin/nutch parse $SEGMENT
Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content.
11. Create linkdb
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
12. Finally index all content from all segments to Solr
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
Now the indexed content is available through Solr. You can try to execute searches from the Solr admin ui from
, or directly with url like
Nutch in combination with Solr is quite a powerful base on which to build your search application. Even if the base is solid there are a few things missing from the stack that you will soon be aware of if you start to index content on larger scale. One of the missing features is duplicate content removal, but luckily there is an improvement issue for this in Nutch Jira https://issues.apache.org/jira/browse/NUTCH-684. Another missing piece from Solr side is a feature called field collapsing
(https://issues.apache.org/jira/browse/SOLR-236). The field collapsing feature could be used on when displaying results so that for example at most two pages would be shown for a single host.
The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.