Many users of databases often wonder what a full text search engine can do that a database cannot do. After all, most databases offer some semblance of text-based search, even if it often seems like an afterthought. At the same time, most search engines offer things like storage and some set manipulation logic. How’s a user to decide what to do? In this article, Marc Krellenstein, explores the benefits of a full text search engine in comparison to a database.
By Marc Krellenstein
Full text search engines and relational databases each have unique strengths as development tools but also have overlapping capabilities. Both can provide for storage and update of data and both support search of the data. Full text systems are better for quickly searching high volumes of unstructured text for the presence of any word or combination of words. They provide rich text search capabilities and sophisticated relevancy ranking tools for ordering results based on how well they match a potentially fuzzy search request. Relational databases, on the other hand, excel at storing and manipulating structured data — records of fields of specific types (text, integer, currency, etc.). They can do so with little or no redundancy. They support flexible search of multiple record types for specific values of fields, as well strong tools for quickly and securely updating individual records. Some of the fields in a table’s records may in fact be free form text, such as a product description, and most relational databases today provide support for doing full text searching on the unstructured data. However, the relevancy ranking of results for unstructured text search for most relational databases is not on par with that of the best full text search systems.
Some applications will naturally be served best by one or the other technology. Others may start in one of the widely installed relational databases but be better served by a full text search engine, a deficiency that may only become noticeable or problematic once requirements change or the volume of data grows. Many applications will rely on both technologies in tandem, either for different parts of the data or by duplicating some of the data in the two systems to get the advantages of both.
The rest of this article explains some of the strengths and weaknesses of full text search engines in more detail and provides some guidelines for choosing the right development model.
Full text search engines excel at quickly and effectively searching large volumes of unstructured text — documents or other ‘records’ containing free form text — and returning these documents based on how well they match the user’s query. They may also have the ability to quickly facet, or categorize, data or search results based on specific values of specific fields. The text search capabilities of the best systems are rich and flexible, and include support for basic keyword searching, Internet-style +/- syntax, use of Boolean operators, limited real or pseudo-natural language processing, proximity operations, find-similar, etc. Relevancy ranking capabilities that determine the best match for a query include using the frequency of query terms in the document, their frequency in the database as a whole (the presence of rarer query terms in the document are usually more significant in indicating a good match), proximity of query terms near each other in the document, special weightings for particular terms, fields or documents and more.
These documents are typically of one type or structure. This structure often consists of a primary free form text field (e.g., the main body of a document, or the main description of a product), additional secondary free form text fields (e.g., a title or abstract) and some non-text or more constrained text fields (e.g., date of publication, size, price, product code, etc.). It’s common to refer to the main text field as the primary data or text and to the other fields (title, price, etc.) pertaining to it as metadata. However, these documents or records — the terms can be used interchangeably to refer to a single indexed ‘item’ in a full text search system — can also be viewed as simply a concatenation of fields. Any number of these fields may be free form text, and any of them may be non-text or more constrained textual data (e.g., one of some number of product codes).
Full text search systems generally adopt this view of the data they are indexing and searching: each document/record is simply a collection of fields. A given search is always run against a single field or some combination of fields, though the end-user may not be aware of this, especially if the default is to search all fields together.
Full text search systems generally depend on some type of index in order to perform queries. Most common is an inverted index, which effectively lists every term — every word, number, etc. — in every document together with an indication of which documents contain that term (and where, if searching on phrases or other proximity operations are supported, as they usually are). There may be a separate index for each field, or all fields may be contained in a single index. In a given document there may be no value for one or more fields. However, the structure is the same for all documents in that the set of possible fields is the same for all documents for a given full text index. The full text search system usually contains some capabilities for handling non-text fields, such as range searching, the ability to sort results by any field, etc. But these capabilities are not as strong as they are in a relational database.
Search results will come from that one set of indexed documents with one set of fields. That set may be an aggregation of documents from many sources and of many types. The set of fields defined for the index will need to include all the fields to be searched on from any of the document sources..This can mean that certain information needs to be repeated for certain fields, e.g., if the city field is Boston and there’s a need to also store or search state information, the state will need to be Massachusetts for every record for which the city is Boston. In a ‘normalized’ relational database, the fact that Boston is in Massachusetts would only be stored once. Some full text systems may provide some capability for effectively ‘joining’ data of different types, minimizing such redundancy. This might be done by additional special index structures, by pre-defined ‘filter’ queries that retain information about common query constraints (effectively joining queries with stored queries) or by multi-pass search techniques. These multi-record type capabilities are not as rich or as straightforward as they are in a relational database.
In addition to indexing the data, most modern full text search systems, including Lucene/Solr, let you actually store and retrieve the data in its original form. One reason they do so is to be able to easily populate a search result list with actual data — e.g. a document’s title or summary — to make it easier to see which documents are most relevant and worth opening for full review. Selected documents/records are often then opened from their original location, but one can also store the entire document/record in the search system and view it from within the system.
Modern full text search systems also support incremental indexing, including the ability to add, delete or update records. Still, full text systems are somewhat limited in their ability to rapidly and securely process transactional updates. This is partly because the speed and scale advantage of full text systems for text search are due in good measure to sophisticated index compression to represent that a given word might occur in millions of specific documents. This compression limits the ability for selective index update. Some full text search engines nevertheless support near real-time updating, often in a memory-based index partition that is rolled into a fuller disk-based index at some point in the background.
These full-text search capabilities of the best systems can be summarized as follows:
- Sub-second search results indicating which documents out of possibly millions or billions contain one or more terms (a word, number, etc.) in the user’s search. This includes good search of all text fields, and somewhat more limited capabilities for searching non-text data. This may also include efficient faceting or categorizing of content or search results based on specific values of specific fields.
- Rich and flexible text query tools and sophisticated ranking capabilities to find the best documents/records.
- Basic capabilities for adding, deleting or updating documents/records.
- Basic capabilities for storing the data (and not simply indexing and searching it). Not all full text search systems support this capability but most do, including Lucene/Solr.
- Limited capabilities for searching and manipulating data that actually represents different record types.
The application requirements that might suggest choosing a full text search system over a relational database are those corresponding to the above strengths and limitations of a full text search system:
- High volume of free form text data (or records containing such data) to be searched or faceted/categorized — hundreds of thousand or millions of documents/records (or more).
- High volume of interactive text-based queries to be supported.
- Demand for very flexible full text search querying.
- Demand for highly relevant search results not met by an available relational database.
- Relatively fewer demands for different record types, non-text data manipulation or secure transaction processing.
This list provides some reasons to choose a full text search system in the first place, or to migrate part or all of a relational database application. Such a database application may have been adequate when the application started out but now has one or another performance or effectiveness problem because of growth in data, number of users or type of search requests. Many people moving to full text search systems in fact find themselves in just that situation regarding the search of their unstructured content. In some cases this is because the original choice of a relational database was made less because of the real ‘relational’ (multi-table) or ‘database management’ (transaction processing) requirements of the application but just because a persistent store was needed for the data, and a database seemed a natural choice and was available. The availability of database skills and lack of full text search skills may also be a reason. But although the database may have been adequate for a while for the full text search needs it was supporting, a change in the environment or simply a demand for better text searching now motivates the users to look for a better solution.
Even when there is more than one record type in the database, a full text search system may still be a more effective tool once the data is flattened, i.e., records from different tables are combined into a single longer record format suitable for a full text search engine. This will generally work as long as there isn’t a very large number of tables making up the data to be searched, and provided that multi-table record manipulation (and sophisticated transaction processing) are not critical components of the application. Many relational database applications fall into this category and have only one or a handful of tables and limited needs for rich transaction processing or recovery. (In fact, if true DBMS needs are minor enough then a search engine may be a faster yet functionally adequate technology even if full text search needs are not significant.)
In some cases data will need to remain in the relational database because of application requirements, but the full text searching provided by the database is not adequate in some way. In those situations the data may be exported into the text system for full text indexing and search and the two systems used together. Such export might be relatively static (e.g., a nightly process) or more dynamic — with data extracted on the fly and/or indexed more rapidly, perhaps in real-time. Most good full text search systems provide a mechanism for easily mapping tabular data in relational databases into the indexed fields of the full text system.
Full text search systems excel at high speed search and faceting of large volumes of data. They are not as strong as relational databases at handling multi-record types or transaction processing, but they may be adequate for these needs in many cases. Some DBMS applications are really there because of convenience, and not because they demand the capabilities of a relational DBMS. Such applications can be successfully migrated to full text search systems if the DBMS no longer meets the application’s needs. Other applications can be partially migrated to full text search systems to support the full text search needs, or data can be indexed in both to provide the advantages of each.