Indexing with SolrJ

| By

Tags: #Data Import Handler#SolrJ#Tika

Two popular methods of indexing existing data are the Data Import Handler (DIH) and Tika (Solr Cell)/ExtractingRequestHandler. These can be used to index data from a database or structured documents (say Word documents, or PDF or….). These are great tools for getting things up and running quickly, and I have seen productions sites that work well with one or both of these tools.

OK, then why talk about SolrJ?

Well, somewhere in the architectural document are two boxes that have labels like this, connected by an arrow:

Solr Server  ——-miraculous connection——-> datasource

Oh, all right. Rarely is the connector between the Solr Server/Indexer and the data it’s going to index labeled “miraculous connection”, but I sometimes wish people would be more honest about it. All sorts of things can get in the way here, I’ll mention 0.01% of them:

  • The security people WILL NOT “just open the database for the IP address of the Solr indexer, please”.
  • Actually, it’s not one datasource. It’s three. At least. And, by the way, you really need to cache data from database 1 or performance will die. Don’t forget that each of them requires different credentials.
  • Did I mention that the “miraculous connection” is where our business rules that have to be encoded into the Solr documents live?
  • Hey! I thought we could run the documents through categorization during this step!
  • Actually, the meta-data lives in the DB, and you take that information then query the file server which will deliver the document to you.
  • We’re indexing movies. Right, you only want the metadata. What? Solr is going to throw 99.99999% of the data away after you’ve transmitted how much data over my network?
  • <insert your favorite problem here>

And this doesn’t even mention that DIH and Tika are run on the server. That is, the Solr indexer is doing all the work and the poor thing can only go so fast (alright, it blazes, but parsing a bazillion PDF/Word/Excel documents is quite a load for a single machine).

My point is that often there are sound reasons why using DIH and/or Tika are not optimal, from security to taking more control over how “bad” documents are handled to throughput to what tools your organization is most comfortable with. For those situations SolrJ may be the most appropriate. What follows is a skeletal program that:

  1. Connects from a Java program to a Solr server, the indexer in this case.
  2. Queries a database via JDBC and selects information from a table, putting it into a suitable form for indexing.
  3. Traverses a file system and indexes all of the documents Tika can parse given a directory.
  4. Adds these documents to the Solr server.

 

Please note that the example is very simple, nothing being done here couldn’t be done easily with DIH and Solr Cell. The  intent here is to provide a starting point for you to adapt to your particular situation where DIH and Solr Cell won’t work right out of the box.

Lots of buildup, not much code

For all the above, the program itself is pretty short. I’ll outline some highlights, and the complete listing is at the end of this article.

Jars and where to get them.

There are three sets of jar files you’ll need to run this example.

  1. The Solr jar files. There are two places to look for these, <solr_home>/dist and <solr_home>/dist/solrj-lib. The classes you need to have to make a SolrJ file do its tricks will be in these two  directories.
  2. The Tika jar files. I’d recommend downloading Tika from the Apache project, see Apache Tika and putting those jars in the classpath for your client.
  3. The appropriate JDBC driver. This will vary depending upon the database you’re connecting to. Often it is available somewhere in your database installation, but just search “jdbc <your database here>” and you should be able to find it.

Note that the only change on the server (and then, only if you’re running the SolrJ program on the server) is the JDBC driver (if necessary)! If you happen to be running the server and client on the machine, you can simply add the appropriate paths to your CLASSPATH environment variable. Otherwise, you’ll have to copy any jars you need from the server to your client.
Here’s what that code looks like. The full source at the end handles traversing the filesystem etc.

Set up the Solr connection

This is just the code to set up the connection to the Solr server. The URL passed in to this method is the same URL you’d use to query Solr, e.g. http://localhost:8983/solr. The only extra bit is at the end, where we set up the Tika parser.

 

private SqlTikaExample(String url) throws IOException, SolrServerException {
      // Create a multi-threaded communications channel to the Solr server.
      // Could use CommonsHttpSolrServer instead.
// For SolrCloud, use CloudSolrServer which takes the
// ZooKeeper address.
_server = new StreamingUpdateSolrServer(url, 10, 4); _server.setSoTimeout(1000); // socket read timeout _server.setConnectionTimeout(1000); _server.setMaxRetries(1); // defaults to 0. > 1 not recommended. // binary parser is used by default for responses _server.setParser(new XMLResponseParser()); // One of the ways Tika can be used to attempt to parse arbitrary files. _autoParser = new AutoDetectParser(); }

 

Index a structured document with Tika from a SolrJ program

 

 ContentHandler textHandler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 InputStream input = new FileInputStream(file);
   // Try parsing the file. Note we haven't checked at all to
   // see whether this file is a good candidate.
 try {
   _autoParser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
      // Needs better logging of what went wrong in
      // order to track down "bad" documents.
   log(String.format("File %s failed", file.getCanonicalPath()));
   e.printStackTrace();
   continue;
 }
    // Just dump ALL the meta-data, remove this
    // in any production environment of course.
 dumpMetadata(file.getCanonicalPath(), metadata);
    // Index just a couple of the meta-data fields.
 SolrInputDocument doc = new SolrInputDocument();
 doc.addField("id", file.getCanonicalPath());
    // Crude way to get known meta-data fields. Also
    // possible to write a simple loop to examine all the
    // metadata returned and selectively index it and/or just
    // get a list of them.
    // One can also use the LucidWorks field mapping to
    // accomplish much the same thing.
 String author = metadata.get("Author");
 if (author != null) {
   doc.addField("author", author);
 }
 doc.addField("text", textHandler.toString());
 _docs.add(doc);

 

There’s an outer loop that’s in charge of traversing the filesystem that I haven’t shown. All that’s happening here is that Tika is allowed to do its thing. If Tika fails to parse the document (notice I haven’t taken any care to determine that the files are reasonable, for instance a jar file or an exe file or whatever could be parsed), we log an error and continue.

If the document does parse, we extract a couple of fields and throw them into the Solr document. That document is then added to a Java List, and eventually when there get to be 1,000 documents in the list, the whole thing is passed to Solr for indexing as you can see in the full listing. By the way, The example index that comes with the Solr distribution will already have these fields defined.

But note a subtlety here, even in the trivial case. We assume that the meta-data field for author is “Author”. There are no cross-format standards for this, it might be called “document_author”, or maybe you want “last_editor”. You can control all of this here either by judicious configuration of Tika or programmatically.

Onwards to the Sql bit

Next, we’ll look at the code that connects via JDBC to a MySql database. Again, it’s the simplest of database tables and the simplest of extractions. You probably won’t be using the SolrJ solution unless your situation is more complex than this, but this gets you started.

 

    Class.forName("com.mysql.jdbc.Driver").newInstance();
    log("Driver Loaded......");

    con = DriverManager.getConnection("jdbc:mysql://localhost:3306/test?" + "user=test&password=test");

    Statement st = con.createStatement();
    ResultSet rs = st.executeQuery("select id,title,text from test");
    while (rs.next()) {
         // DO NOT move this outside the while loop
      SolrInputDocument doc = new SolrInputDocument();

      doc.addField("id", rs.getString("id"));
      doc.addField("title", rs.getString("title"));
      doc.addField("text", rs.getString("text"));

      _docs.add(doc);
   }

 

Again, this is the same sort of process as the last, but now instead of parsing the structured document, we fetch rows from a table and add selected values from those rows to each Solr document. And again we collect those documents in a list to be sent to Solr eventually.

Conclusion

As you can see, using Tika and/or SQL/JDBC from a SolrJ client is not very complicated. I suppose this blog is prompted by the number of requests on the Solr user’s list that request samples of how to use SolrJ to index documents. It is rather daunting to be confronted with the whole of the Solr API documentation and not have a clue where to start, I hope this example de-mystifies the process a bit.

Environment

I compiled and tested this code against a current trunk (4.0) version of Solr, but it should work when compiled against a 3.x version. If not, the changes should be minimal.

Full source code and disclaimer

One of the delights about writing examples is that one can leave out all the ugly error-handling, logging, etc. This code needs to be beefed up considerably for production purposes, and your situation will almost certainly be much more complex or you wouldn’t have a need to worry about SolrJ in the first place! So feel free to use this as a basis for going forward with SolrJ but it’s only an example after all.
Also note that I’ve used SolrJ as an example, but there are other implementations, C#, PHP, etc. The Java version is the one I’m most comfortable with, and tends to be the most up-to-date. That said, the other clients are perfectly fine if they fit into your comfort zone/environment better.

And I’ll have to ask you to pardon the ugly formatting, but cut-n-paste it into your favorite IDE and it’ll look much better!

package SolrJExample;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;

/* Example class showing the skeleton of using Tika and
   Sql on the client to index documents from
   both structured documents and a SQL database.

   NOTE: The SQL example and the Tika example are entirely orthogonal.
   Both are included here to make a
   more interesting example, but you can omit either of them.

 */
public class SqlTikaExample {
  private StreamingUpdateSolrServer _server;
  private long _start = System.currentTimeMillis();
  private AutoDetectParser _autoParser;
  private int _totalTika = 0;
  private int _totalSql = 0;

  private Collection _docs = new ArrayList();

  public static void main(String[] args) {
    try {
      SqlTikaExample idxer = new SqlTikaExample("http://localhost:8983/solr");

      idxer.doTikaDocuments(new File("/Users/Erick/testdocs"));
      idxer.doSqlDocuments();

      idxer.endIndexing();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  private SqlTikaExample(String url) throws IOException, SolrServerException {
      // Create a multi-threaded communications channel to the Solr server.
      // Could be CommonsHttpSolrServer as well.
      //
    _server = new StreamingUpdateSolrServer(url, 10, 4);

    _server.setSoTimeout(1000);  // socket read timeout
    _server.setConnectionTimeout(1000);
    _server.setMaxRetries(1); // defaults to 0.  > 1 not recommended.
         // binary parser is used by default for responses
    _server.setParser(new XMLResponseParser());

      // One of the ways Tika can be used to attempt to parse arbitrary files.
    _autoParser = new AutoDetectParser();
  }

    // Just a convenient place to wrap things up.
  private void endIndexing() throws IOException, SolrServerException {
    if (_docs.size() > 0) { // Are there any documents left over?
      _server.add(_docs, 300000); // Commit within 5 minutes
    }
    _server.commit(); // Only needs to be done at the end,
                      // commitWithin should do the rest.
                      // Could even be omitted
                      // assuming commitWithin was specified.
    long endTime = System.currentTimeMillis();
    log("Total Time Taken: " + (endTime - _start) +
         " milliseconds to index " + _totalSql +
        " SQL rows and " + _totalTika + " documents");
  }

  // I hate writing System.out.println() everyplace,
  // besides this gives a central place to convert to true logging
  // in a production system.
  private static void log(String msg) {
    System.out.println(msg);
  }

  /**
   * ***************************Tika processing here
   */
  // Recursively traverse the filesystem, parsing everything found.
  private void doTikaDocuments(File root) throws IOException, SolrServerException {

    // Simple loop for recursively indexing all the files
    // in the root directory passed in.
    for (File file : root.listFiles()) {
      if (file.isDirectory()) {
        doTikaDocuments(file);
        continue;
      }
        // Get ready to parse the file.
      ContentHandler textHandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      ParseContext context = new ParseContext();

      InputStream input = new FileInputStream(file);

        // Try parsing the file. Note we haven't checked at all to
        // see whether this file is a good candidate.
      try {
        _autoParser.parse(input, textHandler, metadata, context);
      } catch (Exception e) {
          // Needs better logging of what went wrong in order to
          // track down "bad" documents.
        log(String.format("File %s failed", file.getCanonicalPath()));
        e.printStackTrace();
        continue;
      }
      // Just to show how much meta-data and what form it's in.
      dumpMetadata(file.getCanonicalPath(), metadata);

      // Index just a couple of the meta-data fields.
      SolrInputDocument doc = new SolrInputDocument();

      doc.addField("id", file.getCanonicalPath());

      // Crude way to get known meta-data fields.
      // Also possible to write a simple loop to examine all the
      // metadata returned and selectively index it and/or
      // just get a list of them.
      // One can also use the LucidWorks field mapping to
      // accomplish much the same thing.
      String author = metadata.get("Author");

      if (author != null) {
        doc.addField("author", author);
      }

      doc.addField("text", textHandler.toString());

      _docs.add(doc);
      ++_totalTika;

      // Completely arbitrary, just batch up more than one document
      // for throughput!
      if (_docs.size() >= 1000) {
          // Commit within 5 minutes.
        UpdateResponse resp = _server.add(_docs, 300000);
        if (resp.getStatus() != 0) {
          log("Some horrible error has occurred, status is: " +
                  resp.getStatus());
        }
        _docs.clear();
      }
    }
  }

    // Just to show all the metadata that's available.
  private void dumpMetadata(String fileName, Metadata metadata) {
    log("Dumping metadata for file: " + fileName);
    for (String name : metadata.names()) {
      log(name + ":" + metadata.get(name));
    }
    log("nn");
  }

  /**
   * ***************************SQL processing here
   */
  private void doSqlDocuments() throws SQLException {
    Connection con = null;
    try {
      Class.forName("com.mysql.jdbc.Driver").newInstance();
      log("Driver Loaded......");

      con = DriverManager.getConnection("jdbc:mysql://192.168.1.103:3306/test?"
                + "user=testuser&password=test123");

      Statement st = con.createStatement();
      ResultSet rs = st.executeQuery("select id,title,text from test");

      while (rs.next()) {
        // DO NOT move this outside the while loop
        SolrInputDocument doc = new SolrInputDocument(); 
        String id = rs.getString("id");
        String title = rs.getString("title");
        String text = rs.getString("text");

        doc.addField("id", id);
        doc.addField("title", title);
        doc.addField("text", text);

        _docs.add(doc);
        ++_totalSql;

        // Completely arbitrary, just batch up more than one
        // document for throughput!
        if (_docs.size() > 1000) {
             // Commit within 5 minutes.
          UpdateResponse resp = _server.add(_docs, 300000);
          if (resp.getStatus() != 0) {
            log("Some horrible error has occurred, status is: " +
                  resp.getStatus());
          }
          _docs.clear();
        }
      }
    } catch (Exception ex) {
      ex.printStackTrace();
    } finally {
      if (con != null) {
        con.close();
      }
    }
  }
}

Share on FacebookTweet about this on TwitterShare on Google+Pin on PinterestShare on RedditShare on LinkedIn

Your email address will not be published. Required fields are marked *

*

One Comment