You’ve been hearing me do a lot of talking about finding meaning in data, so it may not come as a surprise that of all the track sessions at Lucene Revolution, perhaps the one I was looking forward to the most was the one I attended last, “Lots of Facets, Fast“, from Anne Veling.
Here are the slides for this session.
Imagine you can see 160 years of history, all on one screen. You can zoom and pan, you can look at a particular day, you can even do a search. And when you do, the results come up not as a list, but as a heat map that shows where in history that topic appears, and how often.
The system covers 160 years of newspapers, with every single issue — more than 58,000! — appearing on a single canvas. When you do a search, the app sends an AJAX request that gets back a heat map specifying the color to overlay over each of the tiles, with brighter colors representing greater numbers of mentions in that particular issue.
And that’s where facets come in. After all, what is a facet anyway? It’s a count of how many results appear in a particular “bucket”. In a traditional system, you might do a search for “Indians” and get something along the lines of
... 1870  1871  1872  1873  1901  ...
and so on. That’s exactly the information Veling needed for his heat map, only instead of years, each facet would represent one day, or one issue of the newspaper.
But getting those facets posed some interesting challenges, and presented some interesting opportunities.
For one thing, this may be one of the only times in your professional life you’ll ever see a Lucene index in which absolutely nothing is stored. Because nothing is ever returned except the facet counts, nothing needs to be. (Once the user finds the issue he or she wants, the paper is returned to them through a different process.)
From there, it was partly a matter of tuning — the
filterCache, set by default at 512, needed to be at least 60,000 so that the entire set of facets could be resident in memory, plus he used DocSets to reduce memory requirements — and partly a matter of knowing the data.
Rather than having each document check against all 58K+ facets to see where it belongs, he instead did a hierarchical check; first by decade, then by year, then by month, then by day. Also, because he knew that a document could belong to only one facet, he created a custom runtime collector that stops looking when it finds one. By making those changes, he was able to reduce the number of checks per document from 58560 to just 34.5. Considering that he was dealing with more than 28 million documents, it’s no surprise that he was able to increase performance by a factor of 30.
The result of all this is a beautiful system in which you can clearly see patterns such as the 1978 newspaper strike, leap years, or the fact that this renowned Sunday paper didn’t initially exist at all. You’ve probably seen those old science fiction movies where the time machine has a dial that selects years. Here, the very sleek iPad interface lets you zoom in, in an experience reminiscent of the flying function of Google earth — and instead of traveling through time, you travel through space.
Next up, Veling will work on a similar Belgian system, representing over 1.2 million books, CDs, and DVDs.
The real beauty of this project isn’t so much the actual interface (though it is beautiful) but the fact that it exists at all. This kind of innovative thinking is the direction we need to go in order to give meaning to all this information we are beginning to see.
It’s a new day, and it is great to see Lucene and Solr at the forefront of the revolution.
Cross-posted with Lucene Revolution Blog. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.