Using DocValues in Solr 4.2
In the latest release of Apache Solr (4.2), support for DocValue field types were added. DocValues have been in the works in Lucene for a while (~5 years), also under the name “column stride fields”, but recently have become stable enough to incorporate into Solr. For those of you not familiar with the Lucene lingo, DocValues are column-oriented fields. In other words, values of DocValue fields are densely packed into columns instead of sparsely stored like they are with stored fields.
To illustrate with some JSON:
row-oriented (stored fields)
{ 'doc1': {'A':1, 'B':2, 'C':3}, 'doc2': {'A':2, 'B':3, 'C':4}, 'doc3': {'A':4, 'B':3, 'C':2} }
column-oriented (docValues)
{ 'A': {'doc1':1, 'doc2':2, 'doc3':4}, 'B': {'doc1':2, 'doc2':3, 'doc3':3}, 'C': {'doc1':3, 'doc2':4, 'doc3':2} }
When Solr/Lucene returns a set of document ids from a query, it will then use the row-oriented (aka, stored fields) view of the documents to retrieve the actual field values. This requires a very few number of seeks since all of the field data will be stored close together in the fields data file.
However, for faceting/sorting/grouping Lucene needs to iterate over every document to collect the field values. Traditionally, this is achieved by uninverting the term index. This performs very well actually, since the field values are already grouped (by nature of the index), but it is relatively slow to load and is maintained in memory. DocValues aim to alleviate both of these problems while keeping performance comparable.
Dataset and Schema
To demonstrate the characteristics of DocValues, I have constructed the an index using 300,000 New York Times articles that is mixed with stored, indexed, and docValue fields. The data set is essentially an inverted index of (docId, wordId, count). We will create one Solr document per entry and add a couple of other fields for the sake of the example.
We will be using the following fields to create our index:
- id: auto-generated
- docId: int
- wordId: int
- count: int
- threadId: string (the name of the Java thread that sent this doc)
- word: string (the dereferenced wordId)
For each of these fields (except “id”), we will create three Solr fields: one stored, one indexed, and one DocValue. We will also include a “text” field that is a copy field of word which is indexed as text_en (you know, for search). Here is part of the resulting Solr schema.xml:
[gist id=5265560]
There are 300,000 docId
values and 102,660 wordId
values, with a total of 69,679,427 combinations in the data set. threadId
was added to show a field with very low cardinality (4), and word
is simply wordId
dereferenced against a dictionary supplied with the data set (so we have some text to search on).
Here is an example segment from the resulting index:
1445843 Mar 26 10:43 _89.fdt 655 Mar 26 10:43 _89.fdx 1239 Mar 26 10:43 _89.fnm 62012 Mar 26 10:41 _89.nvd 46 Mar 26 10:41 _89.nvm 377 Mar 26 10:43 _89.si 589009 Mar 26 10:41 _89_Lucene41_0.doc 1301657 Mar 26 10:41 _89_Lucene41_0.tim 34516 Mar 26 10:41 _89_Lucene41_0.tip 442588 Mar 26 10:41 _89_Lucene42_0.dvd 122 Mar 26 10:41 _89_Lucene42_0.dvm
The “dvd” and “dvm” files are the DocValues, “tim” and “tip” are the terms index and dictionary, and “fdx” and “fdt” are the stored fields. You can look up what the rest of those files are in the Lucene documentation. Without going any further, we can see that the DocValues are much more compact than the stored fields and the term index just by looking at the file sizes (recall that we are storing each of our fields as stored, indexed, and docValues separately). Since values for a single field are stored contiguously, very efficient packing algorithms can be used.
The whole thing ends up just under 4Gb with 69,639,425 documents (my indexing script lost a batch at the end). Here’s a Gist of my indexing code.
Procedure
My environment:
- Mac OS X 10.7.5
- 2.8 GHz Intel Core i7
- 16 GB 1333 MHz DDR3
- Apache Solr 4.2
- java -Xmx4g -Xms4g -jar start.jar
Before each test, I would run the purge
command to flush the disk cache. I would then run the query once to get the initial timing, and then measure the average of the next 10 requests. I am only testing faceting performance with the following query:
http://localhost:8983/solr/nytimes/select?q=*:*&facet=true&facet.field=$field
I have also disabled all the caches configured in solrconfig.xml.
Results
field first avg mem+ gc- mem-gc ------------------------------------------------------ threadId_dv 2434 1798 30.5 19.7 10.8 threadId_idx 3421 1175 40.2 24.3 15.9 wordId_dv 6594 2882 157 14.9 142 wordId_idx 8311 2019 279 14.1 265 docId_dv 5502 2356 78.1 22.5 55.6 docId_idx 4915 1458 282 16.3 266 word_dv 8521 3266 209 60.0 149 word_idx 17322 1209 690 490 200
- first: time in milliseconds for the first request
- avg: average time in milliseconds for the subsequent 10 requests
- mem+: increase of active memory during first request
- gc-: memory recovered by manual garbage collection after first request
- mem-gc: mem+ minus gc-
For intial loading time, the results are varied. DocValues do better in some cases, term index does better in others. The performance differences here depend largely on the number of unique values for the field and the field type. The biggest difference in loading time is the field “word_idx”, which take twice as long to load from the inverted index and uses three times as much memory. For repeated access to the same field, the inverted index performs better due to internal Lucene caching (also the reason for higher memory consumption). In all cases, DocValues consume less memory during loading and after garbage collection.
Discussion
DocValues have many potential uses. As we have seen from our little experiment, they are less memory hungry than indexed field and typically faster to load. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. All in all, DocValues are an excellent tool to add to your schema-design toolbelt.
Looking Forward
Naturally, when thinking about column-oriented databases, analytical processing comes to mind. If someone put an analytical processing engine out in front of Solr Cloud, you could build a high performance, scalable OLAP database. The low memory overhead of DocValues and scalability of Solr Cloud are two necessary ingredients – all that is left is to create something like the StatsComponent, but more flexible and pluggable to do aggregations across the whole index. Just some food for thought.
Links
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.