Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster.  I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities.

Since Lucene treats most everything as Strings, encoding  numbers and dates and then utilizing them in ranges has always required a little extra work to make it perform well.  Previously, one would have to have either use less precision or slower running queries in order to work with ranges that had a lot of distinct values.  This is due to the need for Lucene to enumerate through a large number of terms.

Now, however, thanks to the new org.apache.lucene.search.trie package (currently located in the contrib/queries area of Lucene, but it may move to the Lucene core, see [1] below) and it’s addition to Solr via a FieldType, Lucene and Solr users can take advantage of much faster range searches.

To try it out, I gave it a spin Solr by doing the following:

First, I grabbed the FieldType definitions from the Solr example schema:


<fieldType name="tint" class="solr.TrieField" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
<fieldType name="tint4" class="solr.TrieField" precisionStep="4" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
<fieldType name="tfloat" class="solr.TrieField" type="float" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
<fieldType name="tlong" class="solr.TrieField" type="long" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
<fieldType name="tdouble" class="solr.TrieField" type="double" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
<fieldType name="tdouble4" class="solr.TrieField" type="double" precisionStep="4" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />

Next, I declared a few Fields.  In my case, I just used a bunch of dynamic fields, because I’m not too worried about the actual field names:


<field name="id" type="string" indexed="true" stored="true" />
<dynamicField name="*_i" type="integer" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_si" type="sint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_sl" type="slong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_sf" type="sfloat" indexed="true" stored="true"/>
<dynamicField name="*_sd" type="sdouble" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>

<dynamicField name="*_tri" type="tint" indexed="true" stored="true"/>
<dynamicField name="*_tri4" type="tint4" indexed="true" stored="true"/>
<dynamicField name="*_trf" type="tfloat" indexed="true" stored="true"/>
<dynamicField name="*_trl" type="tlong" indexed="true" stored="true"/>
<dynamicField name="*_trd" type="tdouble" indexed="true" stored="true"/>
<dynamicField name="*_trd4" type="tdouble4" indexed="true" stored="true"/>

Now it’s time for the rubber to meet the road, as they say.  In other words, we need to put in some documents and search.  In my case, I’m going to create  three fields for every numeric value, one as a “regular” primitive, one as a “sortable” primitive, and one as a Trie* field, just to underscore some of the differences.  My SolrInputDocument looks like:

SolrInputDocument result = new SolrInputDocument();

result.addField(idField, id);

addField(result, "integer_i", "integer_si", "integer_tri", random.nextInt(10000));//get only positive values
addField(result, "float_f", "float_sf", "float_trf", random.nextFloat()*100);
addField(result, "long_l", "long_sl", "long_trl", Math.abs(random.nextLong()));
double dbl = random.nextDouble() * 100;
addField(result, "double_d", "double_sd", "double_trd", dbl);
result.addField("double_trd4", dbl);

(all the addField does is add the value to the document three times, one for each field name.  I could have used copyField in the schema.xml instead, but I wanted to explicitly declare them during document construction as opposed to editing the schema later when I change my mind).  I then created an index with 10K documents.

Let’s start off with some basics.  First up, I want to find all docs that have an integer value following between 0 and 5,000.

Naturally, the wrong way to do this is:

http://localhost:8983/solr/select/?&version=2.2&start=0&rows=10&indent=on&fl=id,integer_i&q=integer_i:[0%20TO%205000]&debugQuery=true

In this case, you get the lexicographic sorting and not the numeric sorting, which yields results, albeit bad ones.  Bad Grant.

So, then I move over to the sortable version, using the same query, but changing the field to be integer_si.  This now produces correct results.  Not much to say here other than it works.

Finally, I tried  out the Trie stuff by changing the query to use integer_tri.  The results are the same as the sortable stuff, but, in my totally informal testing, the Trie stuff is a lot faster (others have done more formal testing, so I feel comfortable with my results.)  Good news!

Next, I tried it with a lot more documents (1M) and saw pretty much the same speedups.  In this case, I actually tried getting a much bigger range (50K), which is slower than the smaller range, but still faster than the sortable approach.

Of course, this is only scratching the surface.  The take away, though, is the new Trie stuff in L/S holds a lot of promise for speeding up range based numeric queries and further blurs the line between search engines and databases (I’d argue it makes search all that more compelling, but…)  More importantly, it is not dependent on the index size, but instead the precision chosen.  Essentially, it formalizes what many people have done in practice through the years with various field values.  See [2] for more details.

Look for the Trie Range stuff (possibly renamed) to be officially released in Lucene 2.9 and Solr 1.4

References:

  1. http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-queries/org/apache/lucene/search/trie/package-summary.html
  2. http://www.thetaphi.de/share/Schindler-TrieRange.ppt

About Grant Ingersoll

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.