Indexing rich files into Solr, quickly and easily

This past weekend I presented yet another “Rapid Prototyping with Solr” presentation, this time back in the saddle with the No Fluff, Just Stuff symposium in Raleigh, NC. I intentionally waited until the last minute to hack together a quick script to index some data I haven’t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a simple Ruby script to index a directory full of rich (PDF, HTML, Word, etc) documents into a fresh Solr 3.3.0 install. Only a few seconds later I have my documents indexed, and even searchable through a user interface.

Here’s the steps I took:

  1. Download and “install” (aka unzip) Apache Solr 3.3.0
  2. Launch Solr (cd example; java -jar start.jar)
  3. Index files

That’s it.  Here’s the indexing script I used:

require 'net/http'

@dir = Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs")

@url = URI.parse("http://localhost:8983/solr")
@connection = Net::HTTP.new(@url.host, @url.port)

def index(filename)
@connection.get(@url.path + "/update/extract?stream.file=#{filename}&literal.id=#{filename}")
end

def commit
@connection.get(@url.path + "/update?commit=true")
end

@dir.each {|name|
  f = "#{@dir.path}/#{name}"
  if File.file?(f)
    puts "Indexing #{f}..."
    index(f)
  end
}

puts "Committing..."
commit

puts "Done!"


To make it look prettier, only a little dabbling with the templates is needed – add your company logo, customize the colors. And a change to the example (/browse handler) configuration to facet on content_type will allow you to easily search just within documents of specific types through the included UI.  The example code above indexed the docs that ship with Apache Solr 3.3.0; just change the path to a directory of yours to index your own content.

Share the knowledge

You Might Also Like

The New SEO: How to Make Your Products Discoverable by AI Assistants

Quick Take: AI assistants are increasingly performing product research on behalf of...

Read More

AI Product Discovery vs. Traditional Search in B2B Manufacturing and Distribution

In the high-stakes world of B2B manufacturing and distribution, the "findability" of...

Read More

Is Your Product Catalog Ready for AI Buyers?

AI assistants are increasingly acting as buyers on customers' behalf. Instead of...

Read More

Quick Links