It’s that time of year, so I thought I would take a look back at the year that was for the Lucene Ecosystem and maybe look ahead just a little bit too.
First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving. Not only is it a great time to be involved in open source, it’s a great time to be involved in Lucene. Both as a committer and as an employee of Lucid Imagination, I’m continuously amazed at the vibe produced by the people using the Lucene suite of libraries, tools and applications. People are routinely solving both large scale and really hard problems using the Lucene ecosystem and they are doing it on time and on budget. For instance, this year alone, I’ve seen companies and individuals using Lucene and Solr to provide search in production environments with document counts ranging from the few tens of thousands all the way up to 5-10 billion plus and query rates that barely register a blip to 1000+ QPS. I’ve also seen many people using Lucene to power recommendation engines, content management systems, machine learning/NLP applications and log analysis tools.
Much to my initial surprise, the number one reason I hear for why they chose Lucene: flexibility. (I thought it would be the fact that they are free to use, but that is just icing on the proverbial cake, I guess) Namely, Lucene gives them the flexibility to build what they want or simply to use it out of the box. It gives them the flexibility to bring in other tools from other open source projects or other commercial vendors, all without compromising speed or scale.
With that in mind, I thought I would give some highlights of both the top level project (TLP — http://lucene.apache.org — the ASF project that “houses” all of the Lucene related subprojects) that is Lucene as well as the individual projects. (I’m not involved in all them, so please correct me if I’m wrong!)
Whew! It’s been a busy year for the Lucene TLP. We started the Open Relevance Project (ORP), added PyLucene (a Python port of Lucene) and successfully graduated a .NET version of Lucene from ASF incubation, not to mention the fact that the Lucene PMC is responsible for overseeing the release of all the various bits and bytes for each and every subproject (which is a lot of releases!) We also, for the first time ever, organized two days of Lucene related talks at ApacheCon US plus two days of training and meetups. (In the past, organization was always handled by the ASF Conference Committee).
In looking ahead for the TLP, I see a continued focus on providing quality software across all the projects. Additionally, keep your ears open, as there is a new sub project brewing that I think will really make it even easier for people to deploy Lucene based solutions. Finally, just as Lucene gave birth to Apache Hadoop and is happy to see it doing so well, there is growing talk that Lucene will look to see Apache Mahout off as it’s own TLP. Of course, none of that is in stone yet!
Lucene Java (i.e. what everyone knows as “Lucene”) continues to not only provide a rock solid indexing and search API, it continues to push forward with new capabilities. In 2009, Lucene did 4 releases (2.4.1, 2.9.0, 2.9.1 and 3.0.0). 2.9.0 was probably the most interesting, as it significantly improved performance in a number of areas, while 3.0.0 removed all of the deprecated APIs and finally, officially, dropped support for Java JDK 1.4. I’ll leave it to the reader to go look up all the features and changes as they are numerous.
Looking ahead, the phrase of the year appears to be “flexible indexing”. Flex Indexing looks to make it even easier for people to custom craft what is in their index, whether that is rich token attributes (aka “typed” payloads), alternative scoring models (like Okapi BM25) or a bare bones index designed for blazing fast speed.
With Lucene as the engine, Solr has evolved into quite the car. Building on all of the goodness that is Lucene, Solr, in 2009, released version 1.4 with a whole slew of new features, faster implementations and bug fixes. Highlights for 1.4 include: improved filtering and faceting performance, support for clustering, rich document indexing via Apache Tika, multi-select faceting (see Lucid’s very own http://find.searchhub.org/ for a demo), many new Query capabilities and a whole bevy of new Components (Terms, Term Vectors, Auto-suggest, deduplication and Statistics on result sets) that truly make Solr an incredible search platform.
Looking ahead, Solr 1.5 (2.0?) is already in the works and looks to have even more functionality. For instance, a lot of work is underway to integrate Apache ZooKeeper and other distributed capabilities, which will help make deploying Solr at scale even easier. Meanwhile, many are hard at work adding “field collapsing” (search result grouping/deduplication) and spatial (local/geo) search.
It’s been a very exciting year (in my completely biased opinion!) for Mahout, the scalable machine learning project under Lucene. In 2009, Mahout shepherded through it’s very first release (0.1) built on the strength of a few dedicated volunteers working to add capabilities for clustering, categorization and collaborative filtering. Next came 0.2 with many new features (frequent patternset mining, Latent Dirichlet Allocation, Random Decision Forests, new recommendation capabilities) API and performance improvements and a growing list of people who stopped lurking and stepped up to help out. Towards the end of the year, Mahout is already reaching a list volume that I find difficult to keep up with if I miss a day or two. For starters, we have taken on the task of integrating/transforming the Colt matrix library for our needs. We are also working on adding truly large scale recommendation capabilities plus adding in a Support Vector Machine implementation and Logistic Regression. Not only that, but the firstname.lastname@example.org mailing list continues to be a valuable resources for people seeking practical advice on deploying machine learning in production environments regardless of the choice of Mahout or not.
In 2010, I suspect Mahout will become it’s own TLP, with several sub projects roughly divided as: core/utilities, recommendations (Taste) and NLP. Of course, until it happens, this is just speculation. I also think Mahout will look to finalize its APIs for a 1.0 release.
In 2009, Apache Nutch released the long awaited version 1.0. This release contained many new indexing and scoring capabilities, as well as integration with Solr. The community continues to be focused on providing large scale crawling and search capabilities by leveraging Apache Hadoop and Lucene/Solr. Currently, the community is actively looking at modularizing Nutch to allow it to more easily plug in other ecosystem components like Tika and Solr while focusing on the primary task of obtaining and managing content via crawling.
Apache Tika is a content extraction framework for “rich” documents like Adobe PDF and Microsoft Office. In 2009, Tika released versions 0.3, 0.4 and 0.5, all with incremental improvements designed to make it more stable and easier to use. Each release also seemed to carry with it a new list of supported file formats as more and more people join the project to lend a hand.
Coming up, I suspect Tika will look to finalize a 1.0 release at some point in 2009 as well as focus in on standardizing, if such a thing is possible, on the metadata artifacts produced by Tika.
Open Relevance Project
The ORP is a project that has been in my brain for several years now and finally got off the ground in 2009. The goal of ORP is to provide corpora, queries, judgments and other tools to help search and machine learning projects discuss relevance in a completely open way. While the project is really young, it is slowly but surely building up steam by adding some basic tools and collections thanks to the hard work of several individuals. In 2010, look for ORP to build out a more complete toolset while attracting more users and contributors. It will also be vital for the ORP to create its own versioned corpora for download (free!) so that all experiments can be reliably reproduced.
Droids is a standalone crawler framework currently in incubation at the ASF. Development was active in 2009, but has not yet had a release. For now, it is a Spring based framework that allows one to quickly build out agents that can go and crawl and process content.
In 2009, Lucene.NET graduated (some infrastructure changes still need to happen) from ASF incubation and became a full-fledged member of the Lucene ecosystem. While I’m not closely involved with Lucene.NET, the community continues to provide value to those looking for a solid search library in .NET. Since the project is mostly autogenerated from the Java sources, the .NET version has tracked the Lucene Java releases fairly closely.
Looking forward, I expect the .NET version will strive to maintain a lockstep march with Lucene releases.
Similar to .NET, PyLucene produces a Python port of Lucene Java. In 2009, PyLucene was formerly welcomed into the Lucene fold via a software donation by Andi Vajda. It continues to produce releases of PyLucene in lockstep with Lucene Java.
Lucy is a “loose” ‘C’ port of Lucene. Lucy finally got off the ground in 2009 and is steadily working on building out a core search library that provides fast search capabilities for languages like Perl, C and Ruby.
For 2010, look for Lucy to continue to grow its community while adding capabilities.
While the past is, of course, no prediction of the future, I think it’s safe to say Lucene is looking to continue to provide significant capabilities and value to both well established and new communities alike. With open source, you never know where the next good idea is coming from, so make sure to stay tuned both here and on the mailing lists for more insight and more cool new capabilities.
Happy Holidays and here’s to an Open Source 2010!