UpdateRequestProcessors, Transforming Data Entering Solr
Update processors have been around for a long time, but they don’t seem to have garnered much attention. This post is intended to give them a little more visibility and show a simple use-case that I ran across recently.
I’m assuming a reasonable familiarity with Solr schema elements here, especially analysis chains, stored/indexed data etc. so I’ll get right to the point.
The high-level problem
There are, as you probably already know, a ton of transformations you can apply to text being analyzed as part of analysis chains defined in your schema.xml. Here’s a partial list of Analyzers and Tokenizers. However, there are two caveats with these:
- All of these only work on data after it has made it through the “data/index fork” (more below) and is on its way to the indexed (not stored) fork.
- These can’t be used with non-text types (e.g. date, numeric, etc).
The data/index fork
This is a term of my own invention. Let’s say you’ve defined a field with ‘indexed=”true” stored=”true” ‘. At a high level, this looks like:
input stream | | _____________ | | stored data | analysis chain | index
The critical bit here is that the raw input stream is split and sent to the stored fork and the index fork independently. Let’s say you have some input like this “my dog’s fleas were bad on 24-May, 2010”. Let’s furthermore claim you want to parse out the date and put it in a tdate field as well as have some or all of it searchable in a text field. How would you go about this? One way, of course is to write a custom update processor. While perfectly reasonable, that comes with some costs you may not want to incur in terms of making sure all of your nodes in a 100-node SolrCloud cluster have the right jar in the right place.
Fine, you say. I’ll just add a CopyField from the text field to a tdate field, apply a regex to it to pull out 24-May, 2010 and transform it into a proper Solr date like 2010-05-24T00:00:00Z and I’ll be fine. So you work with your favorite regex tool, create a wonderful regular expression that’ll do exactly what you want and stick a PatternReplaceCharFilterFactory in your date field and… Oops. tdate (and the other ‘primitive’ types like int/tint, float/tfloat etc.) do not have analysis chains. By the way, Uwe Schindler was kind enough to clue me in on the idea that making primitive types have analysis chains was a non-trivial task.
And even if you could do this, the stored data would still be the original text, “my dog’s fleas were bad on 24-May, 2010” which would be very confusing to a user when they saw what they expected to be a date field like that in a document (remember, when you specify a field to be returned, you get the original input).
Update Processors, where they fits in the scheme of things
Let’s look at how this picture changes with update processors:
input stream | | update processor(s) | ____|_________ | | stored data | analysis chain | index
Notice that the update processor is in the chain before the data/index fork. This means that any transformations you do on the input are reflected both in the stored data and in the indexed data.
What processors already exist?
As you might imagine, there are several that already exist. Well, more than several. I counted up over 30 without even trying. Alexandre Rafalovitch is working on a comprehensive listing with links here.. Meanwhile, let’s look at a use-case.
A simple problem:
You have textual input that could use either commas (,) or periods(.) as the units separator. 1,000,000.00 and 1.000.000,00 are equivalent and you want to
- Put it into a tfloat field
- Convert it to a single input format
It ought to be easy, just use a copyField and… er… remember that you can’t put anything in the analysis chain for a primitive type. Look again at the diagram above, an update processor is just the ticket!
There are two kinds of update processors.
- UpdateRequestProcessor
- ScriptUpdateProcessor. Actually, this is a specialized UpdateRequestProcessor, but I want to call it out especially, see below.
Both require that you make some configuration changes, and both are very powerful. The work is done in a method (I’ll use Java) with a signature of
processAdd(AddUpdateCommand cmd)
The “cmd” variable gives access to the SolrInputDocument which contains all the fields and values for the Solr document, and SolrQueryRequest which is an interface that allows you access to everything about the request. So you can imagine you have the information here to do pretty much whatever you please.
They both fit in the same place in the input pipeline as described above.
So what’s the difference?
The difference is that the script update processor does not require a jar file to work, it just requires a script file to be carried along in the conf directory. Hmmm, seems a like a trivial difference, I mean you still have to change your solrconfig.xml file, debug the code and all that.
True. But for the script update processor, you can let SolrCloud take care of distribution for you. If you have a jar file that your running Solr needs in order to function, you are responsible for insuring that all the nodes in your cluster have the jar in the right place. And that it’s up to date. This can actually be quite painful unless you repackage your jar file with the war (or ear or whatever). Which can be painful itself.
Not so with the script version. Since it’s in the “conf” directory, it’s held by ZooKeeper (of course, you have to put it there). From there on, any time a Solr core reloads it asks ZooKeeper “has my configuration changed?”. If so, all changed files are brought down locally and “just work”. Especially as
you add or move nodes around in your collection, you don’t have to remember to also move your jar.
You an see how this would simplify administration over the jar variant, you create/test your script in one place and then push it to ZK. Reload your collection and you’re done.
Conclusion
Update processors deserve more appreciation. They have advantages when processing data, especially when extracting data from one field and putting it into another, transforming it along the way. And super-especially when going from textual data to “primitive” types since primitive types do not have analysis chains.
Making a custom update processor in Java and carrying the jar along in your classpath for various web containers is certainly do-able. That said, using the script update processor allows one to use any of several scripting languages and have the code automatically distributed to all nodes in a SolrCloud cluster.
One can use whichever variant you’re most comfortable. You could, of course, move the process upstream and have whatever process you have feeding the docs to Solr do the work. Which path you choose is a matter your particular situation.
And don’t overlook the many canned updated processors you can use out-of-the-box! Alexandre Rafalovitch is creating a list of them, a big shout-out to him! And I would be remiss if I didn’t mention Erik Hatcher’s work on the ScriptUpdateProcessors!
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.