Indexing rich files into Solr, quickly and easily

This past weekend I presented yet another “Rapid Prototyping with Solr” presentation, this time back in the saddle with the No Fluff, Just Stuff symposium in Raleigh, NC. I intentionally waited until the last minute to hack together a quick script to index some data I haven’t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a simple Ruby script to index a directory full of rich (PDF, HTML, Word, etc) documents into a fresh Solr 3.3.0 install. Only a few seconds later I have my documents indexed, and even searchable through a user interface.

Here’s the steps I took:

Download and “install” (aka unzip) Apache Solr 3.3.0
Launch Solr (cd example; java -jar start.jar)
Index files

That’s it. Here’s the indexing script I used:

require 'net/http'

@dir = Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs")

@url = URI.parse("http://localhost:8983/solr")
@connection = Net::HTTP.new(@url.host, @url.port)

def index(filename)
@connection.get(@url.path + "/update/extract?stream.file=#{filename}&literal.id=#{filename}")
end

def commit
@connection.get(@url.path + "/update?commit=true")
end

@dir.each {|name|
  f = "#{@dir.path}/#{name}"
  if File.file?(f)
    puts "Indexing #{f}..."
    index(f)
  end
}

puts "Committing..."
commit

puts "Done!"

To make it look prettier, only a little dabbling with the templates is needed – add your company logo, customize the colors. And a change to the example (/browse handler) configuration to facet on content_type will allow you to easily search just within documents of specific types through the included UI. The example code above indexed the docs that ship with Apache Solr 3.3.0; just change the path to a directory of yours to index your own content.

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...

Indexing rich files into Solr, quickly and easily

You Might Also Like

How an electronics giant meets engineers where they are, with 44 million products in catalog

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Build custom AI agents without writing a single line of code? Yep, we did that.