eWeek.com recently posted a nice article by Dr. Yves Schabes, founder of Teragram, on how to make enterprise search better through some higher order processing techniques like metadata generation, applying taxonomies, etc. and doing relevance testing on a regular basis.  Naturally, this got me thinking about all the different ways this relates to the Apache Lucene ecosystem (Lucene, Solr, Mahout, Tika, etc.) and Lucid Imagination.

First, by choosing an open backbone like Lucene and Solr, you are free to plugin the best tool for the job; proprietary solutions often limit you to their own tools and their implementation.  Let’s face it, we can’t be good at everything, so it makes sense to be able to plug in the best of breed for something that isn’t a core competency.  For example, one could choose OpenNLP or Teragram or any other commercial vendor for these capabilities.  Solr, especially, makes it simple to plugin these capabilities through its well defined plugin architecture.  (By the way, for almost every capability out there in this realm, there is an open source alternative that warrants investigation.)

Second, intelligent search–in other words, search that goes beyond simple keyword capabilities–is the leading edge of the field and is being adopted in more and more products, just as Dr. Schabes recommends.  Whether it is intelligent query parsing, better faceting and discovery capabilities or integration with natural language processing (NLP) tools for NER (Named Entity Recognition), sentiment analysis and relationship discovery, the companies making a difference in search are those that intelligently bring together a variety of approaches to solve the problem at hand.  I believe Lucene,  Solr and open source are uniquely positioned to fuel intelligent search because they drive down the cost of experimentation simply because it takes effort to get this stuff right, much of it due to the need to understand your domain and how to translate it into a good user experience.  Furthermore, open source lets you cost effectively fill in your infrastructure and conserve your precious resources for your core competencies.  Why would you pay millions of dollars for a search engine that implements a vector space retrieval model (which most of the commercial vendors do) when you can get the same thing from Lucene for free?  If you suspect that you think Lucene isn’t as good, think again; there’s a reason it is used at the likes of Apple, AOL, Comcast, CNET, Viacom and thousands of others.  If you like bells and whistles and knowing there is a company behind your chosen solution, I’ll do you three better:  with Lucene and Solr you not only get 1) a company that offers support, training, professional services, and bells and whistles, you also get 2) the very large Apache community of users as well who constantly use/test/fix/improve the software and 3) all of the source code,  completely unencumbered, so you are free to change it as you see fit.

Finally, you get to choose whether you even need a particular capability.  On more than one occasion, I have been involved in replacement of a proprietary search package so bloated with unused add-ons that the Solr installation, containing only the required functions, needed an index that was a mere fraction of the size of the proprietary solution, resulting in:

  • Less hardware to achieve the same throughput
  • Less operations costs — more hardware = more hardware failures
  • faster indexing, faster queries, etc.

In short, Lucene and Solr offer a cost effective and fully capable mechanism for improving the efficiency of search along the lines of the approach Dr. Shabes recommends, giving you the freedom to choose based on your idea of what works, not someone else’s.