As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shikhar Bhushan from Etsy’s experiments at Etsy with search-time parallelism.
Is it possible to gain the parallelism benefit of sharding your data into multiple indexes, without actually sharding? Isn’t your Lucene index already composed of shards i.e. segments? This talk will present an experiment in parallelizing Lucene’s guts: the collection protocol. An express goal was to try to do this in a lock-free manner using divide-and-conquer. Changes to the Collector API were necessary, such as orienting it to work at the level of child “leaf”-collectors so that segment-level state could be accumulated in parallel. I will present technical details that were learned along the way, such as how Lucene’s TopDocs collectors are implemented using priority queues and custom comparators. Onto the parallelizability of collectors — how some collectors like hit counting are embarrassingly parallelizable, how some like DocSet collection were a delightful challenge, and others where the space-time tradeoffs need more consideration. Performance testing results, which currently span from worse to exciting, will be discussed.
Shikhar works on Search Infrastructure at Etsy, the global handmade and vintage marketplace. He has contributed patches to Solr/Lucene, and maintains several open-source projects such as a Java SSH library and a discovery plugin for elasticsearch. He previously worked at Bloomberg where he delivered talks introducing developers to Python and internal Python tooling. He has a special interest in JVM technology and distributed systems.
Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…4