From a quiet start as a pet project to a giant in the industry, Apache Lucene is definitely the little (search) engine that could. On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first official import of Doug Cutting’s Lucene project (which started in 1997 and was hosted on SourceForge) into Apache’s Jakarta project (check out the Wayback machine).
And while I wasn’t around in the beginning, I thought I would offer up some (little) known tidbits, links, etc. about Lucene as an ode to the search library that has significantly changed the search world, as well as my own career:
- Lucene was Doug’s way of learning Java! How’s that for a start? It took him 3 months, working 2 days a week to crank out the first version.
- At the time, some commercial search engines could not do incremental updates of the index, meaning you had to re-index all your documents anytime you had an update. Lucene has always had an incremental model, all the way through to today’s Near Real Time features that power the likes of Twitter at 1 billion+ searches and 100M+ new documents per day.
- Field myField = Field.Text(“foo”, “bar”); anyone? Or how about Field myField = Field.UnIndexed(“foo”, “bar”);
- Back then, Lucene had it’s own PorterStemmer, now we just use Snowball.
- Only 1 of the original committers still remains somewhat active.
- Read the old FAQ! True as it ever was. (Mostly)
- Lucene 2.3 drastically improved indexing performance thanks to a thorough overhaul of the innards while barely affecting the API. 4.0 will blow the doors off of previous versions in terms of speed and efficiency.
- Lucene is Doug’s wife’s middle name.
- Lucene has evolved from offering a single vector space scoring model to one that now offers plug-n-play ranking (BM25 anyone?)
- Lucene is ubiquitous. It powers search on everything from mobile devices to web scale engines. I’ve seen indexes as small as 15% of the original content. I’ve also seen indexes grow to several billion documents in size. Lucene has been used as a caching store, an ORM, a cross language search engine, the guts of the popular Solr search server, the retrieval engine for IBM’s Watson as well as several commercial search engines and pretty much everything in between.
- Did you know Apache Hadoop started as a subproject of Lucene? Doug Cutting and Mike Cafarella first built out Hadoop in order to scale out indexing for the Apache Nutch project. From there it was spun out to be a top level ASF project and has gone on to be the de facto choice for large scale distributed processing, much like Lucene is the de facto choice for search! Lucene has also spun out Mahout, Tika, Lucene.NET and Lucy!
As for how Lucene’s impacted me? In 2004, I took a job at the Center for Natural Language Processing at Syracuse University working for Dr. Liz Liddy. My job was to build an Arabic-English cross language search engine. Within a day or two of starting, Ozgur Yilmazel (my boss at the time) said something to the effect of “we’ll be using Lucene for the implementation. Go learn it.” Digging in, I quickly needed a couple of features, the biggest one being Term Vectors, so I updated a patch from an earlier version of Lucene and managed to convince the committers at the time to commit it. From there, I kept supplying patches. Eventually, I was asked to be a committer. Some time after that, Yonik Seeley and Marc Krellenstein approached a bunch of the committers about starting a company and here I am today at the company we (Erik, Yonik, Marc and I) founded back in 2007, Lucid Imagination. I feel fortunate to have the opportunity to work on hard problems in an interesting field and for that, Lucene, in no small part, I thank you.
But enough of my self-indulgence, how has Lucene impacted you? When did you first start using it? What’s your biggest index or fastest QPS? What ways have you used Lucene beyond that of a search engine? Leave a comment and let us know.
Happy 10th Anniversary, Lucene!