Last week I had the privilege and pleasure to travel to Prague for our Apache Lucene EuroCon event.
I want to share my notes on the sessions I attended…
The week began by me delivering the “Solr Application Development Workshop”, a two-day intensive introduction to pretty much all things Solr and a fair bit of Lucene internals too. I ran the workshop using our 3-day Solr training materials, an enormous amount of material and a hearty set of labs too. I ran quickly through some of the material and less lab time than how we deliver the 3-day course, though we were able to pack it all in. The feedback from the attendees admitted that it was fast-paced, yet highly informative – just as planned. The workshop had over 30 students, filling the room to capacity! In parallel to the Solr training, Grant taught his infamous “Lucene Bootcamp” training.
Now on to the conference sessions; there were about 160 people in attendance. The conference opened with our personable and smiling CEO, Mr. Eric Gries waking us up with the “Search Revolution” keynote on directions in open source and Enterprise Search. My favorite was Eric showing the “New User Interface”: simply a search box with a “Find” button, instead of “Search” (internally, we call our http://search.lucidimagination.com system LucidFind). Stephen Dunn of UK’s The Guardian followed with his epic “From Publisher To Platform: How The Guardian Used Content, Search, and Open Source To Build a Powerful New Business Model” talk. Points from his talk worth reiterating: to be part of the web you have to be permanent (cool URLs don’t change), the web is about addressable resources, and these resources are discoverable. The Guardian’s open strategy allows anyone to build on top of their services, becoming an integral part of the web, not just an end-point of news stories. I briefly spoke with the Guardian’s Web Platform Team Lead, Graham Tackley, the previous evening and he very enthusiastically spoke of Solr’s ease of use and speed at which they were able to develop prototypes long before commercial vendors were even able to put together business proposals, and how quickly new developers were able to learn Solr and be productive with it.
Now came the tough part, deciding which of the tracks to attend. I’m just going to cover the sessions I attended:
Marc Sturlese’s “Use of Solr at Trovit, A Leading Search Engine For Classified Ads” – takeaways include: using multiple indexes for different types of data, including indexing previous users searches to present similar searches; good details on how they are scaling with replication and distributed search; how Solr’s extensibility allowed Marc to hook in his own customizations and improvements.
Bo Raun’s “Implementing Solr in Online Media As An Alternative to Commercial Search Products” – I was impressed by Bo’s delivery. Excellent humor, with a straight face. And the basic message conveyed was that implementing Solr in a non-Java environment wasn’t so bad, in fact he mentioned busting out a prototype one evening while his son was practicing jujitsu. I borrowed from his message for my talk the next day. He also made the point that even though the technology was free, and easy to implement, commercial support was still a necessity for deployment acceptance. You can hear him give his presentation, delivered since the conference as a webcast via theserverside.com, here.
Andrzej Bialecki’s “Munching and Crunching: Lucene Index Post-processing” – whoa. I love being around Andrzej. He’s one of the smartest guys I’ve ever met; he explains highly complex topics in an understandable manner. He presented several interesting low-level performance and relevancy related techniques, getting us to think outside the box and use our (Lucid) imagination.
Joan Codina-Filbà’s “Integration of Natural Language Processing tools with Solr” – stemming, lemmatization, entity extraction, oh my. Joan excels at presenting, lots of passion. UIMA and payloads, a rich combination. Solr + SolrJS (now ajax-solr) means “no need to program” and “fast prototyping”, hmm, a recurring theme!
Max and Karl’s “Modular Document Processing for Solr/Lucene” – these are great guys, very knowledgeable with enterprise search. They proposed an architecture for a document processing pipeline for Solr, something desperately needed. Unfortunately there’s so many ways to do this sort of thing that there wasn’t general consensus on the technical details. They proposed using Apache Commons Pipeline.
Shai Erera’s “Social and Network Discovery (SaND) over Lucene” – Shai is one of Lucene’s newest committers (but a seasoned Lucene veteran). Being connected and findable in a large organization is difficult. Tools like SaND dramatically improve productivity and reinventing less wheels in these very large distributed companies.
Dusan Omercevic’s “Query by Document: When “More Like This” Is Insufficient” – Dusan speaks clearly and effectively on smarter terms, disambiguation, and entity extraction. The takeaway quote for me was “the time is better spent curating the index than developing smart algorithms”; I’m interpreting this to mean we should focus on the quality and structure of what we index, and Lucene and Solr’s power will be able to really shine.
Zack Urlocker enthusiastically keynoted day two with “Software Disruption: How Open Source, Search, Big Data and Cloud Technology are Disrupting IT”. He effectively made the case of why we’re here in business, to serve the under-servedproven market. Disruptive – Lucene/Solr – a dramatically lower TCO, and far more customizable than the other enterprise search vendors. It made me smile when Zack used The Motley Fool as an example, as I was one of the initial technologists on-site and used the “rapid prototyping” techniques to get them up and running as quickly as possible. Lucene’s disruptive score: B+. “A few areas need improvement to disrupt the market.” We’re working on it!
Next up was Yonik’s all-conference “Solr 1.5 and Beyond” session. Pretty techie stuff outlining the big features we can expect to see, including improved relevancy techniques (extended dismax), scalability with Solr Cloud (distributed management), geo-spatial integration, “near” real-time indexing, and field collapsing. It’s good to get the word straight from Solr’s creator about these things.
Basis Technology’s Steve Kearns talked about “Building Multilingual Search Based Applications”. Steve expertly demonstrated a number of issues that occur when handling multilingual content, and how these problems are being addressed with their Rosette Linguistics Platform, including entity extraction, language identification, stemming vs. lemmatization, translation, fuzzy search, n-gram tokenization vs. morphological analysis, decompounding, and sentiment analysis. Finally, Steve demonstrated their Odyssey Information Navigator, built on Solr in approximately 1 months time! Just goes to show the power of the Solr platform.
Karl Wright next presented “Lucene Connectors Framework: An Introduction”. Lucene Connectors Framework (LCF) is a sophisticated, pluggable system for extracting content out of heavy duty enterprise content repositories including SharePoint, Documentum, LiveLink, and more. LCF addresses security, incremental crawling, monitoring, and resiliency. An LCF output connector smoothly indexes content into Solr. Expect great things from LCF with much attention now being focused on it. The bulk of this codebase (prior to MetaCarta’s open-source donation to Apache) has been running successfully, and securely, in a number of large corporate and government organizations.
*sniff sniff*, what’s that I smell? Ahh, the scent of information! Tyler Tate and H. Stefan Olafsson spoke right to my UI sensibilities with “The Path to Discovery: Facets and the Scent of Information”. They used some of my favorite terms: information foraging and serendipity. “Search is an evolutionary process.” We search, see the results, get a better idea of what’s available, try some new searches, discover tangentially interesting things, and follow the “scent” to the information we sought originally or to fruitful avenues we hadn’t anticipated in advance. This presentation spoke to the need of good autosuggest, search results display best practices, and of course the many ways facets are being effectively leveraged. Really elegant stuff. See their slides for many examples of these topics.
And appropriately placed in the schedule, my “Rapid Prototyping with Solr” followed. In the true spirit of rapid prototyping, I took the data from the conference (a CSV file of first name, last name, country, title, and company) attendees and iteratively built an attractive search engine including faceting (on country), highlighting, spell checking, and query/score debugging. I even tossed in a cool tree map visualization of the country facet counts. Beyond that, I also built a simple Solr-powered application to run the conferences prize give-aways; And the winner is… click a button and, via an Ajax’d call to Solr and the wonderful VelocityResponseWriter, a random attendee is selected and presented. All this built by tinkering between sessions.
And finally, Chris Male presented “European Language Analysis with Hunspell”. Hunspell is a spell checker for OpenOffice, but even more interestingly the core language rules have been distilled into Lucene token filters to allow broader stemming capabilities than the traditional Snowball algorithm, on a larger number of languages too. From the hunspell-lucene project description, it aims “to provide features such as stemming, decompounding, spellchecking, normalization, term expansion, etc. that take advantage of the existing lexical resources already created and widely-used in projects like OpenOffice”. Sweet! Lucene’s language handling has improved dramatically over the last year, and continues to do so at a rapid pace.
Kudos to Lucid’s marketing team and Stone Circle Production for a classy well run event.