Indexing rich files into Solr, quickly and easily

This past weekend I presented yet another “Rapid Prototyping with Solr” presentation, this time back in the saddle with the No Fluff, Just Stuff symposium in Raleigh, NC. I intentionally waited until the last minute to hack together a quick script to index some data I haven’t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a simple Ruby script to index a directory full of rich (PDF, HTML, Word, etc) documents into a fresh Solr 3.3.0 install. Only a few seconds later I have my documents indexed, and even searchable through a user interface.

Here’s the steps I took:

  1. Download and “install” (aka unzip) Apache Solr 3.3.0
  2. Launch Solr (cd example; java -jar start.jar)
  3. Index files

That’s it.  Here’s the indexing script I used:

require 'net/http'

@dir = Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs")

@url = URI.parse("http://localhost:8983/solr")
@connection = Net::HTTP.new(@url.host, @url.port)

def index(filename)
@connection.get(@url.path + "/update/extract?stream.file=#{filename}&literal.id=#{filename}")
end

def commit
@connection.get(@url.path + "/update?commit=true")
end

@dir.each {|name|
  f = "#{@dir.path}/#{name}"
  if File.file?(f)
    puts "Indexing #{f}..."
    index(f)
  end
}

puts "Committing..."
commit

puts "Done!"


To make it look prettier, only a little dabbling with the templates is needed – add your company logo, customize the colors. And a change to the example (/browse handler) configuration to facet on content_type will allow you to easily search just within documents of specific types through the included UI.  The example code above indexed the docs that ship with Apache Solr 3.3.0; just change the path to a directory of yours to index your own content.

You Might Also Like

Got complex products? It’s time to rethink your PDP FAQ with AI

If your product page isn't converting, your FAQ (or lack thereof) might...

Read More

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

Read More

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Read More

Quick Links