Lucene and Solr are state of the art search technologies available for free as open source from The Apache Software Foundation. Lucene is the underlying search library, and Solr is a platform built on top of Lucene that makes it easy to build Lucene-based applications. Both are full-featured and have excellent performance, relevancy ranking and scalability. These technologies are used today by thousands of organizations and power substantial search applications at AOL, Comcast Interactive Media, IBM, Netflix, LinkedIn and MySpace.
By Marc Krellenstein, CTO, Lucid Imagination
In the last decade a single search engine technology has sometimes been the dominant choice for enterprises interested in producing their own search capability for a web site, product or internal or extranet use. No one product can meet all needs. But a single technology was recognized as the default choice, and users could most easily start their search evaluation by asking if there were reasons not to use it. Today, I believe Apache Lucene and Solr are the default full text search technology for organizations
Lucene is a Java-based search library. It was initially written over 10 years ago by Doug Cutting, who had worked on two search engines before that, including the once popular Excite Internet service. Lucene was one of the first 3rd generation search capabilities. Like Google and Microsoft’s recently acquired Fast, Lucene has an architecture that employs best practice relevancy ranking and querying, as well as state of the art text compression and a partitioned index strategy to optimize both query performance and indexing flexibility.
Unlike those other products, however, Lucene is available for free as open source under the liberal Apache Software license. This license allows users to modify or embed the technology as they see fit, and to keep proprietary, sell and/or re-distribute any resulting product. Lucene is written entirely in Java, though there are today .NET and other versions available. The source code is not merely freely available but actually practical and relatively simple to use or modify. Finally, and perhaps most importantly for an open source project, Lucene has stood the test of time. Today, Lucene has a large number of active contributors and thousands of installations, including production applications at AOL, Apple, CNET, Comcast Interactive Media, IBM, LinkedIn, Monster, MySpace, Netflix, Technorati and Wikipedia. And while there are other open source search projects, none have more than a fraction of Lucene’s installed base and contributors.
Lucene is full-featured and provides
- Speed — sub-second query performance for most queries
- Strong out of the box relevancy ranking — as good or better than the best commercial competitors
- Complete query capabilities: keyword, Boolean and +/- queries, proximity operators, wildcards, fielded searching, term/field/document weights, find-similar, spell-checking, multi-lingual search and more
- Full results processing, including sorting by relevancy, date or any field, dynamic summaries and hit highlighting
- Portability: runs on any platform supporting Java, and indexes are portable across platforms – you can build an index on Linux and copy it to a Microsoft Windows machine and search it there
- Scalability — there are production applications in the hundreds of millions and billions of documents/records
- Low overhead indexes and rapid incremental indexing, especially with versions 2.3 and later
Solr is a layer of code on top of Lucene that transforms Lucene into a search platform for building search applications. Solr was created by Yonik Seeley while at CNET and contributed to Apache by CNET. Solr provides the following capabilities:
- Web service: Solr places Lucene over HTTP, allowing programs written in any language to invoke Lucene
- XML-based schema for managing indexed fields and their characteristics
- System administration tools for configuration, data loading, index replication, statistics, logging and cache management
- Large scale distributed search
- Fixed/paid result list placement
- Faceting — the dynamic clustering of items or search results into categories that lets users drill into search results (or even skip searching entirely) by any value in any field, as seen on popular ecommerce sites such as Amazon
Most users building Lucene-based search applications will find they can do so more quickly if they start with Solr since it contains many of the capabilities needed to turn a core search capability into a full-fledged search application. Most of the more recent large Lucene-based installations mentioned above use Solr, including AOL, Comcast Interactive Media and Netflix, and of course CNET. However, as in any open layered environment, users can still choose to work directly with the underlying Lucene library, perhaps to manipulate or exploit lower level Lucene capabilities.
The fact that Lucene/Solr are Apache open source software provides some significant advantages:
- Free to use — no license fees whatsoever
- Complete source code, providing the independence and control one normally gets only by writing your own software. The Lucene/Solr Apache license allows users to produce or distribute derivative or proprietary works without restrictions.
- Code developed by programmers who are themselves end-users trying to solve pressing end-user needs
- Community — A large, active and helpful community of developers and end-users, with forums and mailing lists for discussion and resolving problems and independent consultants offering more specialized assistance
Open source software — Apache-licensed or otherwise — also has some limitations as compared to the best commercial software:
- No formal support contracts
- No assured availability of training or other professional services to fulfill specific software needs or assist with building an application
- No formalized release testing program, release schedule or assurance of upgrade compatability, though Lucene/Solr contributions must have unit testing before they are committed to the code, and releases receive integrated testing
Building good full text search is a demanding undertaking, and having the best technology is only part of the solution. Search engines such as Lucene/Solr have good default settings and tools to help make applications not only work but to be effective. But the best search applications require understanding both the data and the users. Information must be aggregated and indexed from file systems, databases or web sites and normalized for search. For example, one set of documents may refer to a document name as title, another to it as a heading; a search for ‘fox’ should probably find items with ‘foxes’ in it as well. Potential users’ level of expertise and familiarity with the data must also be considered in the design, and the use of synonyms may be needed (e.g., heart attack = myocardial infarction). Relevancy ranking will generally require tuning based on what users are actually doing to improve an initial application’s effectiveness. More advanced features such as ‘automatic feedback’ may be useful (and, on the other hand, many oft-attempted efforts at improving search can be ignored in favor of current best practices).
A great search application such as Google is only partly a success of raw technology. It also reflects an expert appreciation of the data and users of that particular application. With more than enough good answers for a search on the Internet and even more bad answers, a popularity-weighted ranking will screen out the bad data and find more than enough good data for Google’s typical users. But any particular search application may have very different data and users. Bad data usually does not exceed good data for most search applications, and finding the best results might be more important than finding good enough results. The security and privacy requirements of a typical application may also be very different from those of a public Internet service (or those of an intelligence agency). The art of good search is to be able to transform good generic technology to a good specific applications.
The skills for building a great search application come mostly from having built other ones, but for most users, building a search application is an infrequent occurrence For that reason it can be useful to seek out expert and experienced resources to assist with application design, development and/or deployment, just as it may be valuable to secure expert support resources for ongoing maintenance. Commercial companies such as Lucid Imagination are based on open source but can provide such formal support and assistance for people using those open source tools.