Lucene, Open Source and the Cost of Experimentation
I was at a company a while back that was considering replacing a commercial vendor with a Lucene/Solr based solution. At this company, they paid a good chunk of money for the commercial solution and we were discussing the (de)merits of the system. Way back when, their primary purchase motivation was they wanted a “shrink wrap” product with the peace of mind of a company behind the product (they bought their system pre-Lucid) and money wasn’t much of an issue for them, at least not then. Given their criteria at the time, the purchase made sense on many levels. Since purchasing, however, several things have changed. First, the economy “ain’t what she used to be” and that license fee is roughly equivalent to a good chunk of salaries and second, almost ironically given they thought they wanted “shrink wrap”, the vendor’s wrappings weren’t able to satisfy all of their needs, even though it was clear the information they needed was in the index under the hood. So, one of the primary things they asked the vendor to do is to open up part of the product more so that they could have lower-level access to the index, which would then allow them to experiment more. In their case, they were successful in convincing the vendor to do so, because, let’s just say, they were a rather large company.
This got me thinking about one of the truly great things about Lucene (and, open source, for that matter) and something that most people don’t focus on because they are more interested (rightfully so, for the most part) on front and center costs like licenses, support, training, etc. which are easy to attach dollar signs to and budget for in advance.
The idea I am talking about is what I will call the Cost of Experimentation (COE) and it is something I’ve seen come up for a good number of people when implementing search and it is mainly due, in my opinion (please feel free to add yours), to at least two things:
1. Search is, by definition, subjective. Language is ambiguous. Queries are subjective. Indexing is subjective. Thus, results are subjective.
2. Thinking about data from a search point of view often frees your mind from the rigidity of normalized data. In other words, all this free form, loosey-goosey unstructured text spurs innovative thinking.
The basic idea behind the COE is that you don’t know what you don’t know. Furthermore, figuring out what you don’t know requires you to try some things out.
Say, for instance, you bought a license from Vendor X for your project. You, plus their consultants, do a bang up job and the application is a huge success. Your anointed the company search guru and all is good. In fact, the application is such a success that it spurs a whole new round of innovation within the company. People have all kinds of ideas for how to improve things and a bunch of new ideas for accessing the information (I’ve seen it happen in more than one place). For example, they might start asking if Vendor X can do location-based search or faceting or named entity recognition or some other capability. So, you look on the vendor’s product sheet, see they offer one or more of these and then you dutifully call up your sales rep and say “I need X, Y and Z”, his eyes light up and he tells you that will be $, $$, and $$$.
Bam! You’ve just been hit by the high cost of experimentation. Because you are locked into a license fee model the only way you can experiment and innovate is by paying more license fees to get add-ons that (may) come close to solving your problem. Or, even worse, the vendor doesn’t even have a solution and you either have to wait for them to create it and make it available or pay their consultants to develop it. Now, maybe you could develop it yourself, but what happens when you need access to low-level API’s that the vendor doesn’t supply?
Contrast that with an open source solution like Lucene and Solr. In most cases, the only cost of experimentation are your direct labor costs, which you had with the commercial vendor anyway. In many cases, Lucene and Solr will have an add-on component that will work right out of the box. Finally, if there isn’t a solution you could develop it yourself and you wouldn’t have to worry about needing low-level access to undocumented API’s because you have the source and can use it as you see fit.
Personally speaking, I’ve experimented with Lucene and Solr in a number of ways throughout my career. Whether it is a research-oriented Question Answering system to an Arabic-English cross language search system to playing around with classifiers to simply looking at co-occurrence information for sentiment analysis, Lucene and Solr have allowed me take untested ideas and make them reality. I’d encourage you to take your data, put it into Lucene/Solr and see what it can do. You just might be surprised what you can find!
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.