If you hang around the topic of search long enough, you’ll eventually come to the conclusion that “real time search” is the Holy Grail, or at least something close to it. And there’s a good reason for that; unfindable data is just dead.
Today, Day 1 of Lucene Revolution 2011, Boris Aleksandrovsky talked about how Yammer solves the problem with an interesting architecture that sends updates to a dependency manager, and then a cache that’s integrated with a write-ahead logging system so that nothing’s lost before everything’s indexed by Lucene.
Here are the slides for this session.
All of this works well for them, but creates a few challenges. Some are integral to the fact that they’re basically searching conversations, which by nature provide multiple documents that may not make any sense out of context from the rest of the conversion. Other problems involve the fact that people in conversation typically use short, to the point statements that may not be relevant for a long period of time. For example, “Lunch?” could be a significant part of a conversation, but probably doesn’t mean much outside of it.
To solve some of these problems, Yammer augments data as it comes into their system, and uses additional information such as the social graph. For example, since Yammer is aimed at the enterprise, it makes sense that if you have a question, an answer that comes from your boss might rank higher than one that comes from a co-worker, or from the boss of another division.
Other challenges are posed by the architecture itself, such as providing consistency and a corruption-free index, or out-of-order updates. For example, you may get an update and a delete for the same document at the same time; how do you handle that?
Yammer does it with a combination of monitoring of indexes against a single point of truth and managing update events intelligently. For example, once an item is “deleted”, it doesn’t actually go away; it’s marked as a tombstone. Once that happens, the system can easily ignore subsequent updates for that document.
Eventually real time search won’t require developers to jump through so many hoops, but for now, with intelligent management Yammer seems to have the issue pretty well in hand.
Cross-posted with Lucene Revolution Blog; Nicholas Chase is a guest blogger. This is one of a series of presentation summaries from the conference.