In the latest release of Apache Solr (4.2), support for DocValue field types were added. DocValues have been in the works in Lucene for a while (~5 years), also under the name “column stride fields”, but recently have become stable enough to incorporate into Solr. For those of you not familiar with the Lucene lingo, DocValues are column-oriented fields. In other words, values of DocValue fields are densely packed into columns instead of sparsely stored like they are with stored fields.

To illustrate with some JSON:

row-oriented (stored fields)

{
  'doc1': {'A':1, 'B':2, 'C':3},
  'doc2': {'A':2, 'B':3, 'C':4},
  'doc3': {'A':4, 'B':3, 'C':2}
}

column-oriented (docValues)

{
  'A': {'doc1':1, 'doc2':2, 'doc3':4},
  'B': {'doc1':2, 'doc2':3, 'doc3':3},
  'C': {'doc1':3, 'doc2':4, 'doc3':2}
}

When Solr/Lucene returns a set of document ids from a query, it will then use the row-oriented (aka, stored fields) view of the documents to retrieve the actual field values. This requires a very few number of seeks since all of the field data will be stored close together in the fields data file.

However, for faceting/sorting/grouping Lucene needs to iterate over every document to collect the field values. Traditionally, this is achieved by uninverting the term index. This performs very well actually, since the field values are already grouped (by nature of the index), but it is relatively slow to load and is maintained in memory. DocValues aim to alleviate both of these problems while keeping performance comparable.

Dataset and Schema

To demonstrate the characteristics of DocValues, I have constructed the an index using 300,000 New York Times articles that is mixed with stored, indexed, and docValue fields. The data set is essentially an inverted index of (docId, wordId, count). We will create one Solr document per entry and add a couple of other fields for the sake of the example.

We will be using the following fields to create our index:

  • id: auto-generated
  • docId: int
  • wordId: int
  • count: int
  • threadId: string (the name of the Java thread that sent this doc)
  • word: string (the dereferenced wordId)

For each of these fields (except “id”), we will create three Solr fields: one stored, one indexed, and one DocValue. We will also include a “text” field that is a copy field of word which is indexed as text_en (you know, for search). Here is part of the resulting Solr schema.xml:

[gist id=5265560]

There are 300,000 docId values and 102,660 wordId values, with a total of 69,679,427 combinations in the data set. threadId was added to show a field with very low cardinality (4), and word is simply wordId dereferenced against a dictionary supplied with the data set (so we have some text to search on).

Here is an example segment from the resulting index:

1445843 Mar 26 10:43 _89.fdt
    655 Mar 26 10:43 _89.fdx
   1239 Mar 26 10:43 _89.fnm
  62012 Mar 26 10:41 _89.nvd
     46 Mar 26 10:41 _89.nvm
    377 Mar 26 10:43 _89.si
 589009 Mar 26 10:41 _89_Lucene41_0.doc
1301657 Mar 26 10:41 _89_Lucene41_0.tim
  34516 Mar 26 10:41 _89_Lucene41_0.tip
 442588 Mar 26 10:41 _89_Lucene42_0.dvd
    122 Mar 26 10:41 _89_Lucene42_0.dvm

The “dvd” and “dvm” files are the DocValues, “tim” and “tip” are the terms index and dictionary, and “fdx” and “fdt” are the stored fields. You can look up what the rest of those files are in the Lucene documentation. Without going any further, we can see that the DocValues are much more compact than the stored fields and the term index just by looking at the file sizes (recall that we are storing each of our fields as stored, indexed, and docValues separately). Since values for a single field are stored contiguously, very efficient packing algorithms can be used.

The whole thing ends up just under 4Gb with 69,639,425 documents (my indexing script lost a batch at the end). Here’s a Gist of my indexing code.

Procedure

My environment:

  • Mac OS X 10.7.5
  • 2.8 GHz Intel Core i7
  • 16 GB 1333 MHz DDR3
  • Apache Solr 4.2
  • java -Xmx4g -Xms4g -jar start.jar

Before each test, I would run the purge command to flush the disk cache. I would then run the query once to get the initial timing, and then measure the average of the next 10 requests. I am only testing faceting performance with the following query:

http://localhost:8983/solr/nytimes/select?q=*:*&facet=true&facet.field=$field

I have also disabled all the caches configured in solrconfig.xml.

Results

field           first   avg     mem+    gc-     mem-gc
------------------------------------------------------
threadId_dv     2434    1798    30.5    19.7    10.8
threadId_idx    3421    1175    40.2    24.3    15.9
wordId_dv       6594    2882    157     14.9    142
wordId_idx      8311    2019    279     14.1    265
docId_dv        5502    2356    78.1    22.5    55.6
docId_idx       4915    1458    282     16.3    266
word_dv         8521    3266    209     60.0    149
word_idx        17322   1209    690     490     200
  • first: time in milliseconds for the first request
  • avg: average time in milliseconds for the subsequent 10 requests
  • mem+: increase of active memory during first request
  • gc-: memory recovered by manual garbage collection after first request
  • mem-gc: mem+ minus gc-

For intial loading time, the results are varied. DocValues do better in some cases, term index does better in others. The performance differences here depend largely on the number of unique values for the field and the field type. The biggest difference in loading time is the field “word_idx”, which take twice as long to load from the inverted index and uses three times as much memory. For repeated access to the same field, the inverted index performs better due to internal Lucene caching (also the reason for higher memory consumption). In all cases, DocValues consume less memory during loading and after garbage collection.

Discussion

DocValues have many potential uses. As we have seen from our little experiment, they are less memory hungry than indexed field and typically faster to load. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. All in all, DocValues are an excellent tool to add to your schema-design toolbelt.

Looking Forward

Naturally, when thinking about column-oriented databases, analytical processing comes to mind. If someone put an analytical processing engine out in front of Solr Cloud, you could build a high performance, scalable OLAP database. The low memory overhead of DocValues and scalability of Solr Cloud are two necessary ingredients – all that is left is to create something like the StatsComponent, but more flexible and pluggable to do aggregations across the whole index. Just some food for thought.

Links