I’m a web application developer. I am not going to pretend that I understood everything that Simon Willnauer was saying at his talk on Column Stride Fields, or DocValues, on day 1 of Lucene Revolution. That’s because, as a rule, I don’t have a need to climb into the guts of Lucene in order to get good results. And that’s exactly what I like about it. If you tell me, “under these circumstances, define your field as ‘fieldable’ to get the best performance” that’s good enough for me.
Slides for this session:
But for many people — particularly the types of people who attend Lucene Revolution — Lucene’s guts are where it’s at. And this talk was definitely for them. Fortunately, Simon’s talk was recorded on video, so you can get the uninterpreted details from him as soon as that’s available, but even with my limited grasp, I could tell this was a Good Thing.
Basically, it comes to this, as I understand it. Lucene creates an inverted index, which basically just links a term to a document. But once you have the terms, you have to score those documents in terms of relevance, and to do that, you need access to the data that’s part of that document.
There are two ways to get at Lucene’s indexed data. One way is through stored fields, and the other is through the FieldCache. Stored fields can be slow because you basically have to do two seeks: one to find out what file to look in for it, and then another to actually find the field. FieldCache is faster, because it’s an inverted index that lives in memory, but it still has to be loaded, and then it can take up a lot of memory — more than you may have available, especially if you’re on a mobile device, or limited to a 2GB heap — once it is.
Now, DocValues (not, in this case, as Simon pointed out, the existing DocValues class — this feature will likely be renamed before it’s released) are basically an array, similar to FieldCache. This array can be loaded into RAM, or it can be stored on disk for sequential access, which basically makes no demands on the heap. (He does recommend that you store it in a MemoryMapped buffer for best performance.) As each field needs its own file, however, you will want to watch for “too many files open” errors.
The performance benchmarks he showed were impressive; clearly this will be a huge step forward.
Next, Simon wants to make DocValues updateable, which will be great for scoring based on changing values, for distributed search, and of course for real time search.
Sounds good to me.
Cross-posted with Lucene Revolution Blog; Nicholas Chase is a guest blogger. This is one of a series of presentation summaries from the conference.