acts_as_solr with Rich Document Indexing

A hearty thanks to the Central Virginia Ruby Enthusiasts’ Group, who invited me to speak on Solr+Ruby giving me a good reason to delve deeply back into solr-ruby and acts_as_solr.

Let’s start a Rails project from scratch to illustrate how simple it is to get up and running with acts_as_solr. The example is, indeed, a fairly real-world’ish type of need. We’re going to index resumes, which could be in standard rich document formats such as PDF, Word, HTML, or plain text.

rails resume
cd resume
script/generate scaffold resume first_name:string last_name:string file_name:string
rake db:migrate

Thanks to the magic that is Rails, we now have a working application that allows standard CRUD operations on a resumes table in a relational database. (not discussed further here, but start script/server and to navigate to the usual http://localhost:3000/resumes. We’re going to stick closer to the metal and use script/console for direct ActiveRecord and Solr API tinkering)

Next we add the acts_as_solr plugin to our application:

script/plugin install git://github.com/mattmatt/acts_as_solr.git

A note about the acts_as_solr codebase: it all started with an innocent hack that I posted to the solr-user list. It got picked up [editor 3/18/09: respectfully added a special mention of Thiago Jackiw] by Thiago and he turned it into a serious general purpose ActiveRecord modeling plugin hosted at RubyForge, and now exists as numerous git repository forks. The currently best maintained version is Mathias Meyer’s branch.

And we start Solr:

rake solr:start

We now add Solr to the lifecycle of the Resume model, such that when a Resume is added or updated in the database it also gets indexed into Solr, and deleted from Solr when it is removed from the database. It really couldn’t be any easier:

class Resume < ActiveRecord::Base
  acts_as_solr
end

Plugging in acts_as_solr provides not only the lifecycle hooks to keep the database and Solr in sync, it also provides additional finder methods. Here’s an example of using Resume#find_by_solr:

$ script/console
>> Resume.create(:first_name=>;'Joe', :last_name=>'Programmer')
>> Resume.find_by_solr("program*")
=> 0, :total=>1, :docs=>[#]}>

The “program*” query matches any words indexed that begin with “program”. Note that the result from find_by_solr is an ActsAsSolr::SearchResults instance. This wrapper provides the docs that normally are returned from ActiveRecord finder methods in addition to other Solr information, including the query_time and total number of documents matched. The order of the docs array defaults to descending score (a measure of relevancy to a query).

So far so good – we’ve got a Rails application with an ActiveRecord model tied to acts_as_solr. Now comes the trickier part of indexing the resume text.

Solr Cell
A content extraction library (aka Solr Cell) was added in Solr 1.4. However, at the time of writing acts_as_solr embeds Solr 1.3. So we need to do a little hacking to bring in a newer version of Solr with the Solr Cell dependencies and configuration. In the future, it is likely acts_as_solr will ship with Solr Cell built-in, so be sure to check your version.

First, stop Solr:

rake solr:stop

Grab a nightly build of Solr from http://people.apache.org/builds/lucene/solr/nightly/. Unarchive the distribution, and copy over the lib directory containing the Solr Cell plugin and dependencies, and also replace solr.war (the entire Solr web application).

cp -R apache-solr-nightly/example/solr/lib resume/vendor/plugins/acts_as_solr/solr/solr/
cp apache-solr-nightly/example/webapps/solr.war resume/vendor/plugins/acts_as_solr/solr/webapps/solr.war

And now add the Solr Cell request handler to vendor/plugins/acts_as_solr/solr/solr/conf/solrconfig.xml (add it anywhere as sibling to the other request handlers defined):

And enable remote streaming by setting enableRemoteStreaming=”true” on the requestParsers element.

Enabling remote streaming comes with a stern warning “Make sure your system has some authentication before enabling remote streaming!”. Our best advice is to firewall Solr such that only the application server, or in this example simply localhost itself, can make requests to Solr. Having remote streaming enabled allows some request handlers, if configured, to pull content from a URL or from a local file path. This isn’t necessarily a bad thing, but restricting who or where requests can be made to Solr is a wise production deployment consideration. Even with remote streaming disabled, general /update is accessible and documents can be added or deleted easily. So do take this as a production deployment concern to address in your network architecture.

What this now gives us is the ability to index rich document content with simple requests to Solr. Thanks to Solr’s content streaming flexibility, Solr can get the file content from a local file path, a remotely accessible URL, or through the file actually being POSTed in the request. In this exercise, we’re going to send Solr a local file path, which assumes the Solr and Ruby ActiveRecord application tier can see the same path. Here’s an example of the kind of lightweight request it takes to index a PDF file:

curl "http://localhost:8982/solr/update/extract?stream.file=/path/to/ErikHatcherResume.pd&ext.idx.attr=false&ext.def.fl=text_t&ext.ignore.und.fl=true&ext.map.title=title_t&ext.literal.id=1&wt=ruby"

That’s some ugly parameters, but thankfully the Solr Cell wiki page spells them out in detail. The Solr request in prose – the local /path/to/ErikHatcherResume.pdf is sent to Solr, Solr reads the contents of that file, the text is extracted into the text_t field, undefined fields are ignored, general attributes extracted are ignored, but the title field is mapped to the title_t field, and the id field is mapped literally to the value of 1. The general purpose acts_as_solr schema has a convenient *_t field mapping for bringing in both the text content and metadata attributes as needed and all *_t fields are internally copied to a single searchable “text” field.

The solr-ruby library, at the time of this writing, does not have built-in support for Solr Cell style requests, though it easily allows custom request types to be used. Here’s our solr_cell_request.rb:

class SolrCellRequest < Solr::Request::Select
  def initialize(doc,file_name)
    params = {
      'ext.idx.attr' => false,        # don't index any attributes, unless explicitly mapped
      'ext.def.fl' => 'text_t',        # all text extracted goes to text_t (since it is a stored field, for highlighting)
      'ext.ignore.und.fl' => true,      # ignore all undefined fields
      'ext.map.title' => 'title_t',
      'ext.resource.name' => file_name, # TIKA-154 workaround
      'stream.file' => file_name,
    }
    doc.each do |f|
      params["ext.literal.#{f.name}"] = f.value
      if f.boost
        params["ext.boost.#{f.name}"] = f.boost
      end
    end
    super(nil,params)
  end

  def handler
    'update/extract'
  end
end

class SolrCellResponse < Solr::Response::Ruby
end

During the development SolrCellRequest, I noticed plain text (surprisingly!) files were not indexing. I asked about this and quickly received an explanation. This will be resolved when a newer version of Tika, including TIKA-154, is brought into Solr. In the meantime, setting ext.resource.name solves the issue.

The doc passed into the SolrCellRequest constructor is a Solr::Document. We’ve jumped ahead, knowing that we’ll be able to easily override an acts_as_solr method where the ActiveRecord is available as a Solr::Document. [note that solr-ruby does have the requirement that there be a parallel *Response class to the *Request. This is why the dummy SolrCellResponse is necessary]

script/console is still our friend, let’s give it a try using pure solr-ruby API:

$ script/console
Loading development environment (Rails 2.2.2)
>> solr = Solr::Connection.new("http://localhost:8982/solr")
>> req = SolrCellRequest.new(Solr::Document.new(:id=>1), '/path/to/ErikHatcherResume.pdf')
>> solr.send(req)
>> solr.commit

Checkpoint – we’ve now got Ruby able to index rich files into Solr by a very simple API. What’s left? We have to tie this indexing into the ActiveRecord lifecycle exposed by acts_as_solr. There’s a nice and easy method to override on a per-acts_as_solr-model basis to change how the indexing request to Solr works. It looks like this (in acts_as_solr’s commons_methods.rb):

    def solr_add(add_xml)   # note, it is actually a Solr::Document passed in, not XML
      ActsAsSolr::Post.execute(Solr::Request::AddDocument.new(add_xml))
    end

We’ll override that in our Resume model:

class Resume < ActiveRecord::Base
  acts_as_solr

  def solr_add(doc)
    # puts doc.to_xml.to_s # handy view of the Solr doc acts_as_solr builds
    if file_name
      ActsAsSolr::Post.execute(SolrCellRequest.new(doc, file_name))
    else
      ActsAsSolr::Post.execute(Solr::Request::AddDocument.new(doc))
    end
  end
end

And now the grand finale, the code us Rubyists love to see, that one elegant line of code:

>> Resume.create(:first_name => "Erik", :last_name=>"Hatcher", :file_name=>"/path/to/ErikHatcherResume.pdf")

And a quick test that shows it works (“java” is in my resume):

>> Resume.find_by_solr("java")
=> 1, :docs=>[#]}>

See also Sami Siren’s Content Extraction with Tika article.

We encourage you to provide comments and feedback to us on this entry. Particularly I’m interested in hearing from Solr-using Rubyists out there and what challenges you’ve faced in using Solr and how we can help fix bugs or educate further.

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...

acts_as_solr with Rich Document Indexing

You Might Also Like

How an electronics giant meets engineers where they are, with 44 million products in catalog

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Build custom AI agents without writing a single line of code? Yep, we did that.