After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF — Lucene Connector Framework, OpenNLP and UIMA) you know it has been an amazingly busy and fruitful year.  Instead of going through each project like last year’s review, I’m just going to be a bit less formal and hit on the highlights as I see them.

Before I dig in too much, though, a special thanks to all our customers at Lucid Imagination as well as to my coworkers.  I’m coming up on 15 years out in the “real world” and I can honestly say I’ve never enjoyed what I do as much as I do here and that even accounts for the normal rough patches one goes through in any job.  As an engineer, there are few things as cool as getting to work with customers who are not only using, but pushing your work/project/product on a daily basis to do new and interesting things (I think this is a direct result of the project being Open Source, which I believe has an inherently lower cost of experimentation).  I’ve been fortunate enough to meet and talk with many people doing all kinds of things with Lucene and Solr ranging from the “mundane” of basic keyword search to those building next generation search capabilities at incredible scale.  Through it all, I’m constantly amazed at the flexibility and efficiency of Lucene and Solr.  For instance, I’ve been working with one customer now whose Solr-based solution (for the exact same content) will use ~50% less hardware and will have an index that is 1/6 the size of their FAST index all while saving them major dinero.

Speaking of Lucid, one of the highlights of the year for us that relates directly to Lucene and Solr is the launch of our enterprise version: Lucidworks Enterprise.   I like to think of it as Apache Solr with a whole lot of Lucid expertise on how to use Solr baked in and topped off with other features and functionality to make building search applications easier.

OK, time to move on to the open source projects…

  1. Without a doubt, the biggest news of the year is the merging of the Lucene and Solr code base as well as the “graduation” of several subprojects to Apache Soft. Foundation Top Level Projects (TLP).  The graduating projects are Tika, Nutch, and Mahout.  We also spun Lucy (a C port) to the Incubator, where it is working on it’s own community.  These moves were primarily done to focus the project management on single code base, but they also demonstrate the project has reached a level of maturity at the ASF.  The move also has the side benefit of bringing each project higher visibility.
  2. I’m particularly excited about the addition of OpenNLP to the Apache umbrella.  OpenNLP is a nice open source Java project for natural language processing that has lived at Source Forge for quite some time.  I would expect development to grow quite a bit under the ASF community based model.  Also, integrating OpenNLP with Solr and Lucene is pretty easy to do.  I would be remiss if I didn’t also give a nod to the addition of the ManifoldCF project to the ASF.  ManifoldCF will help unlock content in Sharepoint, Documentum and other repositories for users of Lucene and Solr.
  3. Lucene’s trunk code base now implements our “Flex APIs”, which should allow users to have near total control over what goes in the index as well as alternate compression techniques, different scoring models, etc.  See Michael McCandless’ excellent talk at Lucene Revolution for more details.
  4. With all the location aware devices and capabilities on the market, geo-spatial search is a hot topic and Lucene and Solr have been adding quite a bit of capabilities in this regard with the ability to filter, boost and sort results based on location information in documents.  See Solr’s Spatial Search Wiki page for more info as well as several of my past blog posts.
  5. Of course, everyone was a buzz about the cloud this year.  For Solr, this translates into greater efforts to make Solr easier to scale to very large installations (100s to 1000s of nodes and billions and billions of documents) via the Solr Cloud project that Yonik Seeley and Mark Miller have been spearheading.
  6. On the user side, one of the biggest pieces of buzz this year related to Lucene was the migration of Twitter search to Lucene.  At 1 billion queries per day and 50 million posts per day (all indexed and searchable in near real time), Twitter’s search system certainly has it’s work cut out for itself.  However, as Michael Busch outlined at Lucene Revolution, Apache Lucene was up to the task!  Naturally, there were lots of other companies that migrated to Solr and Lucene as well.  Have you shared your use case?

Well, I’ve no doubt missed a bunch of other things, but those items, to me, are some of the bigger highlights.  Looking forward, there are some other exciting things coming to Lucene and Solr.  In particular, I’m working on adding language identification, related searches and point in polygon filtering to Solr.  I would also expect we will release Lucene/Solr 3.1 fairly soon, too, but you can’t pin me down on a date just yet.

Here’s hoping you all have a Happy Holidays and a Happy New Year!