Indexing with SolrJ

| By

Tags: #Data Import Handler#SolrJ#Tika

Two popular methods of indexing existing data are the Data Import Handler (DIH) and Tika (Solr Cell)/ExtractingRequestHandler. These can be used to index data from a database or structured documents (say Word documents, or PDF or….). These are great tools for getting things up and running quickly, and I have seen productions sites that work well with one or both of these tools.

OK, then why talk about SolrJ?

Well, somewhere in the architectural document are two boxes that have labels like this, connected by an arrow:

Solr Server  ——-miraculous connection——-> datasource

Oh, all right. Rarely is the connector between the Solr Server/Indexer and the data it’s going to index labeled “miraculous connection”, but I sometimes wish people would be more honest about it. All sorts of things can get in the way here, I’ll mention 0.01% of them:

  • The security people WILL NOT “just open the database for the IP address of the Solr indexer, please”.
  • Actually, it’s not one datasource. It’s three. At least. And, by the way, you really need to cache data from database 1 or performance will die. Don’t forget that each of them requires different credentials.
  • Did I mention that the “miraculous connection” is where our business rules that have to be encoded into the Solr documents live?
  • Hey! I thought we could run the documents through categorization during this step!
  • Actually, the meta-data lives in the DB, and you take that information then query the file server which will deliver the document to you.
  • We’re indexing movies. Right, you only want the metadata. What? Solr is going to throw 99.99999% of the data away after you’ve transmitted how much data over my network?
  • <insert your favorite problem here>

And this doesn’t even mention that DIH and Tika are run on the server. That is, the Solr indexer is doing all the work and the poor thing can only go so fast (alright, it blazes, but parsing a bazillion PDF/Word/Excel documents is quite a load for a single machine).

My point is that often there are sound reasons why using DIH and/or Tika are not optimal, from security to taking more control over how “bad” documents are handled to throughput to what tools your organization is most comfortable with. For those situations SolrJ may be the most appropriate. What follows is a skeletal program that:

  1. Connects from a Java program to a Solr server, the indexer in this case.
  2. Queries a database via JDBC and selects information from a table, putting it into a suitable form for indexing.
  3. Traverses a file system and indexes all of the documents Tika can parse given a directory.
  4. Adds these documents to the Solr server.

 

Please note that the example is very simple, nothing being done here couldn’t be done easily with DIH and Solr Cell. The  intent here is to provide a starting point for you to adapt to your particular situation where DIH and Solr Cell won’t work right out of the box.

Lots of buildup, not much code

For all the above, the program itself is pretty short. I’ll outline some highlights, and the complete listing is at the end of this article.

Jars and where to get them.

There are three sets of jar files you’ll need to run this example.

  1. The Solr jar files. There are two places to look for these, <solr_home>/dist and <solr_home>/dist/solrj-lib. The classes you need to have to make a SolrJ file do its tricks will be in these two  directories.
  2. The Tika jar files. I’d recommend downloading Tika from the Apache project, see Apache Tika and putting those jars in the classpath for your client.
  3. The appropriate JDBC driver. This will vary depending upon the database you’re connecting to. Often it is available somewhere in your database installation, but just search “jdbc <your database here>” and you should be able to find it.

Note that the only change on the server (and then, only if you’re running the SolrJ program on the server) is the JDBC driver (if necessary)! If you happen to be running the server and client on the machine, you can simply add the appropriate paths to your CLASSPATH environment variable. Otherwise, you’ll have to copy any jars you need from the server to your client.
Here’s what that code looks like. The full source at the end handles traversing the filesystem etc.

Set up the Solr connection

This is just the code to set up the connection to the Solr server. The URL passed in to this method is the same URL you’d use to query Solr, e.g. http://localhost:8983/solr. The only extra bit is at the end, where we set up the Tika parser.

 

 

Index a structured document with Tika from a SolrJ program

 

 

There’s an outer loop that’s in charge of traversing the filesystem that I haven’t shown. All that’s happening here is that Tika is allowed to do its thing. If Tika fails to parse the document (notice I haven’t taken any care to determine that the files are reasonable, for instance a jar file or an exe file or whatever could be parsed), we log an error and continue.

If the document does parse, we extract a couple of fields and throw them into the Solr document. That document is then added to a Java List, and eventually when there get to be 1,000 documents in the list, the whole thing is passed to Solr for indexing as you can see in the full listing. By the way, The example index that comes with the Solr distribution will already have these fields defined.

But note a subtlety here, even in the trivial case. We assume that the meta-data field for author is “Author”. There are no cross-format standards for this, it might be called “document_author”, or maybe you want “last_editor”. You can control all of this here either by judicious configuration of Tika or programmatically.

Onwards to the Sql bit

Next, we’ll look at the code that connects via JDBC to a MySql database. Again, it’s the simplest of database tables and the simplest of extractions. You probably won’t be using the SolrJ solution unless your situation is more complex than this, but this gets you started.

 

 

Again, this is the same sort of process as the last, but now instead of parsing the structured document, we fetch rows from a table and add selected values from those rows to each Solr document. And again we collect those documents in a list to be sent to Solr eventually.

Conclusion

As you can see, using Tika and/or SQL/JDBC from a SolrJ client is not very complicated. I suppose this blog is prompted by the number of requests on the Solr user’s list that request samples of how to use SolrJ to index documents. It is rather daunting to be confronted with the whole of the Solr API documentation and not have a clue where to start, I hope this example de-mystifies the process a bit.

Environment

I compiled and tested this code against a current trunk (4.0) version of Solr, but it should work when compiled against a 3.x version. If not, the changes should be minimal.

Full source code and disclaimer

One of the delights about writing examples is that one can leave out all the ugly error-handling, logging, etc. This code needs to be beefed up considerably for production purposes, and your situation will almost certainly be much more complex or you wouldn’t have a need to worry about SolrJ in the first place! So feel free to use this as a basis for going forward with SolrJ but it’s only an example after all.
Also note that I’ve used SolrJ as an example, but there are other implementations, C#, PHP, etc. The Java version is the one I’m most comfortable with, and tends to be the most up-to-date. That said, the other clients are perfectly fine if they fit into your comfort zone/environment better.

And I’ll have to ask you to pardon the ugly formatting, but cut-n-paste it into your favorite IDE and it’ll look much better!


Share on FacebookTweet about this on TwitterShare on Google+Pin on PinterestShare on RedditShare on LinkedIn

Your email address will not be published. Required fields are marked *

*

One Comment