With just a week left before the doors open for this year’s Lucene Revolution, it’s time to start thinking about which sessions you want to attend. (It’s also time to sign in to the community and set up your calendar …) It’s always great to see open source in big companies, and it’s good to see companies like Intuit and Travelocity sending speakers to Lucene Revolution this year. It’s also intriguing to see how these companies are putting the technology to work. To look at just a few:
Trey Grainger of Careerbuilder.com will be talking about their move from Microsoft FAST ESP to Solr, which looks interesting partly on non-technical grounds; when a 1% drop in search result quality will result in literally millions of missed opportunites, how do you get the corporate bosses to take a chance on moving to open source? (And get them to spend money on the transition to boot?) You make sure to build a system that maintains (or improves) your search quality, and you embrace the additional benefits, that’s how.
The number one benefit [of the migration] by far is an increase in our agility. CareerBuilder sees our ability to rapidly respond to market needs as a key competitive advantage. We are now able to do things in hours or days with our Solr implementation that used to take us weeks (or sometimes months if we could do them at all). Some of this speed improvement is related to the underlying technologies in play, but I think most of it is related to the increased focus and expertise in search that has come from us taking our search platform fully into our own hands and being able to customize and dig deeply into the underlying code stack.
Trey will be talking about how Careerbuilder maintained their search quality through the migration, and about how they built a cloud-like API that lets their engineers build search applications and integrate them into their system without having to know anything about Solr.
I’m particularly looking forward to Alberto Mijares’ presentation on Canoo’s Software as a Service. It’s not so much the SaaS aspect (although that’s interesting as well) but Canoo has built a service in which they take articles from multiple publications and use Lucene’s analysis pipeline to add semantic information to articles so they can recommend “related” articles. “It’s basically Lucene’s MoreLikeThis but on steroids,” he told DZone. Canoo takes information from external sources, such as Wikipedia, to add even more semantic richness.
Semantics is a really a complex topic and, above all, it is subjective. What “makes sense” for me can be completely wrong for another person (different knowledge, different experience, different context). What most people don’t know is that Semantic Web technologies perform very well when applied in the field of data integration.
I’ll definitely be interested to hear how this all comes together in the context of Lucene.
Olaf Zschiedrich, who heads up eBay Germany’s classifieds, will be talking about how they were able to build from scratch in just four months partly because search was already handled by Solr. He’ll also be giving some best practices that they’ve learned as the number one classifieds site in Germany.
Information on Yammer should be indexed and available for users to search in real time, virtually in less then a second. This makes the Yammer indexing system similar to Twitter where tweets are indexed in real time. Search results likewise are available in reverse chronological order which is based on the assumption that for certain types of events, timeliness is the most pertinent characteristic. This maps really well into types of content like news where relevancy declines fairly rapidly as time passes, or for types of content which are more transient in nature, like events and meetings.
Considering the volume of material Yammer deals with — their system has 100,000 networks, 2 million users, and scales up to 1 billion messages — it’ll be good to get a handle on how they’ve structured their architecture, especially when you consider that Yammer is a complex knowledge base, and not just a simple query response system.
Our current indexing approach is to treat all logs as simple text, which is fast and flexible – we’ll accept events in any format at all. We create one “index” per customer, sharding by time, and this allows us to grow a users index as large as they need.
Of course, that means their system requires multiple servers all working in concert. This is the kind of thing that you’d think would be complicated beyond the reach of most developers, but Solr Cloud really levels the playing field here, and it will be interesting to hear Jon explain how that works for Loggly.
Finally, I found a video interview with Erik Hatcher about what he’s going to be doing at Lucene Revolution. Once again he’ll be giving his talk on Rapid Prototyping with Solr, and what’s fun about that is that every time he gives it he does something different. But more than that, Erik will be doing some of the Lucene and Solr training on Monday and Tuesday, and how often do you get the opportunity to be trained by one of the foremost committers on a project like this?
Of course these are just a few of the sessions on offer; Lucene Revolution will have four tracks running simultaneously.
Still haven’t registered? Some of the training sessions are sold out, but there are still some seats available for the conference itself.