For those not familiar, SIGIR is the ACM’s Special Interest Group in Information Retrieval and it holds a conference in some nice location (next year is in Austrialia!) around the world each year where the large majority of leaders in IR (academic and commercial) gather to present the latest research in information retrieval on topics ranging from low-level efficiency (anyone up for discussing posting lists?) discussions to recommendation engines to high level user behavior studies on why and how people search.
Why would an open source engineering type like me attend a conference with an academic bent? First and foremost, it’s a high quality conference packed with people doing some really interesting things with search and user analytics at scale. Second, there is a boatload of Lucene and Solr users at the conference and it’s great to interact with them. Third, it refreshes my thinking and causes me to reconsider previously held assumptions about what is happening in search and about why users do the things they do. Fourth, there are a lot of great connections to be made as well as opportunities to find amazing talent.
With that out of the way, I’d say there were several ideas that were reinforced during the conference as well as a few other takeaways for me this year at SIGIR:
- This industry is clearly split into haves and have nots (maybe it’s better labelled as the empiricists vs. theoreticians). The “haves” (Googs, MSFT, Y!, etc.) have a ton of feedback data and they are using it to create some truly awesome things at very large scales. Everyone else has to be really creative about how to generate new and interesting ideas and prove them out. And they always have to answer the proverbial question of did you do this at large enough of a scale to really prove it out. All of this creates a chicken and egg problem: how do you learn from user data if you don’t have users? From a Lucene and Solr standpoint, our challenge is to make sure that we continue to have viable ways for people to leverage large quantities of user data in the search engine itself (doc values, function queries, etc.) as well as make it easier for people to leverage Lucene and Solr to do their user studies, low level research, etc. by having good documentation and easy to read tutorials. We also need to figure out a way to work with graph data more easily.
- In case you haven’t noticed, the days of a single monolithic search engine (esp. at web-scale) that serves up the same type of results day in day out to the same type of users is a thing of the past. A/B/multivariate testing of many (all?) facets of the search experience is a must (or soon will be) for most large scale applications. See Ronny Kohavi’s talk, for example, as an intro.
- I don’t know if it ever will matter in terms of public perception or market share, but Microsoft’s Bing group showed some cool demos and interesting work.
- Noticeably absent this year was much of a presence from Google other than the requisite sponsorship and a booth.
- I could listen to Stephen Robertson (the co-inventor of BM25, amongst other things) all day. He has a wealth of knowledge about IR and is engaging, genuine and a pleasure to be around.
- We all need to figure out a way to crack the open evaluation problem. We need a way of distributing large collections, judgments, click data, etc. to the broader open source community. The Open Relevance Project was a start on this, but it hasn’t gotten off the ground in terms of participation.
- Some interesting companies and people are working on true incremental field updates at scale in Lucene. I don’t know when it will hit or if they will succeed, but keep an eye out.
- Guinness direct from the brewery is by far and away the best Guinness. If you’d like to sample it yourself, then come to Lucene Revolution in Dublin in November!
Last but not least, I’d be remiss if I didn’t give a shout out to my old boss, Paraic Sheridan, of CNGL, who along with Gareth Jones and team, put on a most excellent conference. Now, onto reading through the papers!