Indexing rich files into Solr, quickly and easily

This past weekend I presented yet another “Rapid Prototyping with Solr” presentation, this time back in the saddle with the No Fluff, Just Stuff symposium in Raleigh, NC. I intentionally waited until the last minute to hack together a quick script to index some data I haven’t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a simple Ruby script to index a directory full of rich (PDF, HTML, Word, etc) documents into a fresh Solr 3.3.0 install. Only a few seconds later I have my documents indexed, and even searchable through a user interface.

Here’s the steps I took:

Download and “install” (aka unzip) Apache Solr 3.3.0
Launch Solr (cd example; java -jar start.jar)
Index files

That’s it. Here’s the indexing script I used:

require 'net/http'

@dir = Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs")

@url = URI.parse("http://localhost:8983/solr")
@connection = Net::HTTP.new(@url.host, @url.port)

def index(filename)
@connection.get(@url.path + "/update/extract?stream.file=#{filename}&literal.id=#{filename}")
end

def commit
@connection.get(@url.path + "/update?commit=true")
end

@dir.each {|name|
  f = "#{@dir.path}/#{name}"
  if File.file?(f)
    puts "Indexing #{f}..."
    index(f)
  end
}

puts "Committing..."
commit

puts "Done!"

To make it look prettier, only a little dabbling with the templates is needed – add your company logo, customize the colors. And a change to the example (/browse handler) configuration to facet on content_type will allow you to easily search just within documents of specific types through the included UI. The example code above indexed the docs that ship with Apache Solr 3.3.0; just change the path to a directory of yours to index your own content.

AI agents are dominating shopping. Is your site prepared for AI-powered search?

Generative AI agents like ChatGPT are redefining product discovery. Learn how to...

From search company to practical AI pioneer: Our vision for 2025 and beyond

CEO Mike Sinoway shares insights on AI's future, introducing Commerce Studio™ and...

When AI Goes Wrong: Real-World Fails and How to Prevent Them

Don’t let your AI chatbot sell a $50,000 Tahoe for $1! This...

Indexing rich files into Solr, quickly and easily

You Might Also Like

AI agents are dominating shopping. Is your site prepared for AI-powered search?

From search company to practical AI pioneer: Our vision for 2025 and beyond

When AI Goes Wrong: Real-World Fails and How to Prevent Them