In my preparation for my upcoming talk on Apache Hadoop and Search, I thought I would try out using Nutch (the genesis for Hadoop) to index some content to Solr.  I started off by referencing Sami Siren’s excellent post on Nutch and Solr (which worked flawlessly for 1.1 for me on OS X) to get up and going, but quickly hoped there is a much easier way to do this than typing in all of these commands.  And indeed, there is in Nutch 1.1 and later, so I thought I would provide a refresh.  Here’s the steps I did to run Nutch 1.1 with Solr (trunk, but other versions should work too)

  1. Download Nutch 1.1 and extract it.
  2. Download Solr or check it out from SVN and extract it.
  3. Download trihug.tar.gz and unpackage it.  This contains steps 5, 6, 7 and 8 in Sami’s article with a few slight modifications to the /nutch RequestHandler for display purposes.  I did this in a separate trihug/ directory so as not to “disturb” the original files for Nutch and Solr configuration
  4. cd <PATH TO SOLR>/solr/example (change into Solr’s example directory)
  5. Start Solr with the new Solr Home: java -Dsolr.solr.home=<PATH>/trihug/solr -jar start.jar
  6. In a separate terminal, export NUTCH_CONF_DIR=<PATH>/trihug/nutch/conf
  7. Here’s where it gets MUCH simpler: You don’t need to run all the commands separately, instead do:
    1. cd <PATH>/trihug/nutch
    2. mkdir output
    3. <PATH>/apache-nutch-1.1-bin/bin/nutch crawl urls/ -dir output -solr http://localhost:8983/solr -depth 2
      1. You should see something like:crawl started in: output
        rootUrlDir = urls
        threads = 10
        depth = 2
        Injector: starting
        Injector: crawlDb: output/crawldb
        Injector: urlDir: urls
        Injector: Converting injected urls to crawl db entries.
        Injector: Merging injected urls into crawl db.
        Injector: done
        Generator: Selecting best-scoring urls due for fetch.
        Generator: starting
        Generator: filtering: true
        Generator: normalizing: true
        Generator: jobtracker is ‘local’, generating exactly one partition.
        Generator: Partitioning selected urls for politeness.
        Generator: segment: output/segments/20100910124448
        Generator: done.
  8. Do a Solr Commit: curl http://localhost:8983/solr/update –data-binary ‘<commit/>’ -H ‘Content-type:text/xml; charset=utf-8’
  9. Browse to http://localhost:8983/solr/nutch/ (again, this assumes you are using my setup laid out below)

Assuming all went well, you should see something like the screenshot below after completing the last step:

Nutch Solr

There are a couple of things to note here:

  1. I ran the Nutch crawl on my machine.  It was not distributed.
  2. Likewise for Solr.
  3. The VelocityResponseWriter (i.e. the display you see above) is intended for prototyping purposes and is not meant for production use.

If you have any questions, please feel free to drop a comment.

Happy crawling!

PS Thanks to fellow Lucid Imagineer Andrzej Bialecki for the guidance on Nutch!