Real-time Search With Lucene
Real-time search is kind of a fuzzy concept, but basically it means dropping the time a modification to an index takes to be seen by users to a near negligible quantity – or a small enough time difference to be acceptable for a given real-time application. Not all applications need real-time search, but the type of application that does need it is very popular these days – social networking sites. The average social networking site would like user changes to be search-able almost immediately. When it comes to Lucene, this type of rapid update application has required you to jump through quite a few hoops and accept more than a few compromises. The future looks a bit more rosy though.
The longer term hope for real-time search in Lucene has been to create an IndexReader that can read the un-flushed state that IndexWriter holds in RAM. Easier said than done though. What is actually materializing at this time is a slightly different approach – as soon as Lucene 2.9, you will be able to ask for an IndexReader from a live IndexWriter. One of the guys working on this (Lucene guru Mike McCandless) calls this ‘near real-time’ search. Briefly, it works like this (note: I am not working on this issue, and do not know it in depth – just following along):
When you ask for the IndexReader from the IndexWriter, the IndexWriter will be flushed (docs accumulated in RAM will be written to disk) but not committed (fsync files, write new segments file, etc). The returned IndexReader will search over previously committed segments, as well as the new, flushed but not committed segment. Because flushing will likely be processor rather than IO bound, this should be a process that can be attacked with more processor power if found to be too slow. Also, deletes are carried in RAM, rather than flushed to disk, which may help in eeking a bit more speed. The result is that you can add and remove documents from a Lucene index in ‘near’ real time by continuously asking for a new Reader from the IndexWriter every second or couple seconds. I haven’t seen a non synthetic test yet, but it looks like its been tested at around 50 documents updates per second without heavy slowdown (eg the results are visible every second). The patch takes advantage of LUCENE-1483, which keys FieldCaches and Filters at the individual segment level rather than at the index level – this allows you to only reload caches per segment rather then per index – essential for real-time search with filter/cache use.
I can’t wait to see this work start creeping into Solr.
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.