Indexing rich files into Solr, quickly and easily
This past weekend I presented yet another “Rapid Prototyping with Solr” presentation, this time back in the saddle with the No Fluff, Just Stuff symposium in Raleigh, NC. I intentionally waited until the last minute to hack together a quick script to index some data I haven’t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a simple Ruby script to index a directory full of rich (PDF, HTML, Word, etc) documents into a fresh Solr 3.3.0 install. Only a few seconds later I have my documents indexed, and even searchable through a user interface.
Here’s the steps I took:
- Download and “install” (aka unzip) Apache Solr 3.3.0
- Launch Solr (cd example; java -jar start.jar)
- Index files
That’s it. Here’s the indexing script I used:
require 'net/http'
@dir = Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs")
@url = URI.parse("http://localhost:8983/solr")
@connection = Net::HTTP.new(@url.host, @url.port)
def index(filename)
@connection.get(@url.path + "/update/extract?stream.file=#{filename}&literal.id=#{filename}")
end
def commit
@connection.get(@url.path + "/update?commit=true")
end
@dir.each {|name|
f = "#{@dir.path}/#{name}"
if File.file?(f)
puts "Indexing #{f}..."
index(f)
end
}
puts "Committing..."
commit
puts "Done!"
To make it look prettier, only a little dabbling with the templates is needed – add your company logo, customize the colors. And a change to the example (/browse handler) configuration to facet on content_type will allow you to easily search just within documents of specific types through the included UI. The example code above indexed the docs that ship with Apache Solr 3.3.0; just change the path to a directory of yours to index your own content.