As someone who has been in information retrieval for some time now and who also has done a stint in an academic research lab and works on an open source search engine that has a huge commercial base, but mixed coverage in academia (more later), I was a little unsure of what to expect in heading to my first ever SIGIR conference in Portland, OR last week.

Nevertheless, I was honored to be invited by the organizers of the Open Source Information Retrieval workshop to give a keynote on “OpenSearchLab and the Lucene Ecosystem”  (slides embedded below) as well as present the excellent work of two colleagues of mine — Andrzej Bialecki and Robert Muir — in our paper simply titled “Apache Lucene 4” (full proceedings at (Please be sure to read our acknowledgment section in the paper, because the fact that the 3 of us are authors on the paper is simply due to us doing the writing on the paper — the actual systems work is, of course, the labor of many, many more.)

Despite my uncertainty, the conference was quite compelling, both as a learning event as well as a networking opportunity.  As always, it’s challenging at any conference to predict what sessions will be worthwhile from a learning perspective.  I mostly went to sessions on architecture and performance, but also a few on collaborative filtering, federated search and user intent.  On Wednesday, I spent most of my time in the Industry track which, other then a fairly transparent sales pitch by Amazon on CloudSearch — (full disclosure, we offer our own hosted search as do many other companies), was pretty informative.  By my count, 5 of the companies represented in the Industry track, use Lucene/Solr in production in some part of their system, including IBM Watson and Twitter.

As for the rest of the main conference, I really enjoyed the panel of IR gods on the use of proprietary data in experiments.  While I tend  to see valid points from both sides of the debate, it would be good if we all could figure out a legal and technical framework (so we can avoid another AOL query log fiasco) for distributing truly open datasets, log data, evaluations and related annotations which anyone can access by simply having an Internet connection  — bandwidth issues not withstanding.  This is, of course, no small task, given the very real privacy issues involved.  As important, if not more important, I would really like to see the code and configuration of the experiments of others such that I can see if the results are generalizable to other domains as well as understand how they came to their conclusions.  Obviously, none of the search giants (Googs, etc.) are going to do that, but I simply don’t buy any of the explanations for why Academics don’t (it usually comes down to deadlines and the “ugliness” of the code).  For example, I’ve contacted a number of researchers who have done Lucene evaluations to better understand how they arrived at their results and why they are so sharply different from my own (and others in the Lucene community) unpublished results.  It quickly becomes clear that it is an apple and orange kind of issue and there is often some notion of “out of the box” when in reality Lucene is an IR toolbox meant to be wired together per a particular task.  Moreover, they usually don’t put in the legwork on preprocessing or analysis to make for good results, but who can really ever know, since the evaluation setup and code isn’t published.  Practically speaking, most people in Lucene simply tune relevance through query representation and external features that yield very good results (for those who care, Lucene 4 now has all of the “standard” relevance models referenced in literature: BM25, DFR, Lang. Modeling, etc.) when focused on their real world cases (which is almost always precision @ 10, biased by business needs like price, time, page rank, social network, margins, inventory and other internal scoring models) rather than on the model itself.

Moving along, on Thursday, I gave the two talks I described earlier at the workshop and participated in the rest of the session on how to leverage more shared knowledge in open source.  My invited talk, as you can see in the slides, was mostly about how Lucene and the ASF works, as well as some high-level thoughts on how I would setup the OpenSearchLab if I were doing it.  I also made the case that Lucene 4 is much more pluggable now and so it is easier to plugin new ranking functions, etc. which may be of interest to those working on IR algorithms and data structures.  As far as I can tell, Lucene and Solr’s usage has roughly split along the lines of the type of research people are doing.  Those doing more user focused things like eye tracking, UI and HCIR tasks love Lucene and Solr.  Those doing algorithms and data structures, not so much.  Not an unreasonable position, mind you, since when one is doing research on those things, you need deep knowledge about all the bits of the system, which is much easier to do when you wrote it.  On the other hand, one also spends a lot of time on scaffolding that likely doesn’t matter as much and which Lucene has already implemented.

My second talk was a 15 minute talk covering our paper and focused in on key things we do in Lucene 4.  I got a pretty strong sense that the work many of the committers have done on Finite State Automata/Transducers was compelling as well as some of our new speed numbers. There were some other good presentations and discussions and, I think, a few positive steps which hopefully will come to fruition that will help in reproducibility of results:

  1. The setting up of a mailing list for open source IR tools to collaborate
  2. The need for a common repository where people can publish their code.

In fact, in many ways, these are both already setup as part of Lucene’s little known Open Relevance Project.  If any of the workshop organizers are so inclined, we could certainly use this as the basis for sharing.  If there are objections to it being a part of Lucene, we could certainly spin it out into a standalone Apache project.

Keynote slides:


Lucene 4 slides: