by Amit Nithianandan
The Relational Databases (RDBMS) is the cornerstone of data persistence in software development. While modern data workloads have the RDBMS under fire recently due to some of its scalability and speed constraints, its longevity, portability, abundance of well-written GUI management tools, and ease of querying still makes it the popular application data storage mechanism of choice. And tabular data representation–rows of records and columns of fields – are an intuitive way to organize many transactional data types.
As a result, it’s natural to think about the relational model when trying to organize the data for your search application. At the same time, there are some real advantages to using an inverse-index based system such as Solr/Lucene to design the search service for your application, so users can quickly wade through mountains of data. So what does the RDBMS do best, and what should you rely on Solr for?
Two models for Dealing with Data
The key thing to keep in mind when designing your search service is that Solr is not meant to entirely replace your RDBMS but rather complement it. One of the things that Solr does best is to answer a question such as: “What is a list of the most relevant documents or fields, that possibly match query ‘XYZ’? ” In this case, your “XYZs” might match a bunch of documents or records with fields that Solr may have tokenized, stored, queried and/or ranked to produce a list of result documents.
Solr also provides the ability to quickly filter results by facet fields enriching the search experience so that your users can narrow the list to find the right item or set of items based on faceted fields. The underlying inverse index structure lends itself well to this approach, especially where you cannot know a-priori the precise structure of queries or of the data.
By contrast, the RDBMS in its classic implementation is meant to answer questions such as: exact match queries, e.g., “give me all records in my users table with a creation date after Oct 1, 2009”; or, reporting-related queries, like “what is the average file size of images uploaded to my photo site grouped by user and date”. Such precise queries, designed to fit the layout of the data in the classic tabular RDBMS structure, work quite well in this fashion. And of course, there is one other thing the RDBMS does quite well: efficiently executing a series of inserts and updates for a transaction, rolling back if one of those operations failed (also known as ACID properties: (Atomicity, Consistency, Isolation, Durability).
The best way to think about the Lucene index is that it’s a quickly searchable view of your data. A well-designed application can use the best of both these approaches, utilizing Solr to help users find the most relevant documents and then use your RDBMS to query for more precise additional information to better present the results to the end user. To do this, you can design your Solr query to return additional fields containing the necessary keys/ids that in turn can be used to rapidly and precisely query your database for this extra information. Let’s explore how this might work.
The Classic RDBMS Approach
When designing your data driven application, the first question you must answer is: How do I want my users to find data? The simplicity and relative lack of restriction on data structure can make Solr a tempting choice but there are definitely cases in which the application would be better off simply using an RDBMS . For example, if you are building an e-commerce site that lets users find a compatible windshield wiper blade for their car with a simple a drill-down wizard, then an RDBMS-based solution could easily solve the problem, as the desired results are presented to the user by executing a series of fixed, structured queries on the database. First, the top level query could be to first pick the model year; then “show me all car brands” followed by “Show models for <car brand selected earlier>” and finally “show all wipers where the year is A, car brand is B and car model is C.” The user finally can pick whichever style wiper blade is available for his/her car and purchase it online or check if a nearby store has that particular part number representing the desired blade in stock. In fact, in this case the queries could proceed in somewhat arbitrary order; the user could select the brand first, then the year, and then the model – and only then proceed to selecting wipers. It can work without full text searching (i.e. “cheap wipers Hyundai sonata”) or faceting (by attributes such as wiper brand, year or price) to facilitate ad-hoc search result navigation help the user find the just the right wiper blade. At the same time, there are some sequential constraints; you can’t ask if a wiper blade is compatible and in stock before working out the preceding conditions (brand, model, year).
Comparing the Solr Approach with the RDBMS
Now, here’s where we need to draw an important contrast between the pure RDBMS approach and the Solr search approach. Relational modeling teaches us to normalize our data into cohesive entities to avoid redundancy and use multiple tables and foreign keys to establish relationships between these entities. As a result, many developers tend to think the same when dealing with Solr, perhaps by indexing one table in one index and another table in another index and then ask “how can I link the two together in Solr?” or, “Does it make sense to take the results from one index and query the other?” However, it’s important to remember that Solr is unbound by such restrictions and that it is a document store where a each document is a collection of fields. Certain cardinal sins in relational modeling are allowed, and in fact encouraged. Solr readily accommodates having a single field containing multiple values; by contrast, in the RDBMS, this would be modeled using at two or more tables and foreign keys.
As mentioned earlier, Solr supports the concept of multi-valued fields, which are very useful when multiple distinct values need to be stored in a single field (as opposed to a comma delimited string of values stored as a single value). For example, storing user entered tags would certainly be defined as a multi-valued field, one value for each tag, hence allowing each of these tag to be matched when performing a text search. In a document store, a single document represents a single concept with all necessary data stored to represent that concept (compared to that same concept being spread across multiple tables in an RDBMS requiring several joins to re-create). The more non-uniform the data, the more Solr’s freedom from tabular/foreign key constraints can simplify development: documents need not all have the same fields, or even the same values within fields, in order for Solr to find the right documents. A database would require a-priori modeling of all the fields and their abstraction into tables. Flaws in the modeling can impact RDBMS efficiency. As the data changes, the relational model you designed at the outset may or may not work for new fields, entities, etc. which for Solr is not an issue.
Let’s consider the e-commerce example again. If I wanted to allow my users to search for wiper blades available in a store nearby, I might create an index with multiple documents or records for the same exact wiper blade, each document having different location data (lat/long, address, etc.) to represent an individual store. Solr has a de-duplication component to help show unique documents in case that particular wiper blade is available in multiple stores near me. In the RDBMS data model, you may have products, descriptions, product locations, product image locations each stored in a respective table — which is good for data maintenance since each piece of data is logically separated. However, the complexity emerges in writing an SQL query to represent a specific model of a wiper blade and its location.
However — and here’s the main catch — from a search perspective, running such SQL queries as “What RainX wiper blades can I find nearby” on an ad-hoc basis can be resource intensive. You’d end up bombarding the RDBMS with many such queries, across diverse tables, products, locations, at query time with little to no caching which may adversely affect other parts of the system such as purchasing (which may be very transaction heavy) or other back-end inventory management processing. What you might gain in efficient data management, you lose at search time. Solr, on the other hand, can easily handle multiple diverse queries without any of the resource contention you’d find by trying to do it in an RDBMS.
Now, there is one important advantage that an RDBMS still brings to the equation. When you want to add new merchandise, or change information about certain fields across all records (if you want all wipers to list a new part number, for example), it can be handy to have them all in one place, in well-specified, normalized field. Add the rollback and persistence management properties, and you can see why the RDBMS is not going to disappear any time soon.
Making best of both
Fortunately, you can get the best of both approaches by leveraging Solr’s DataImportHandler (DIH) component to help index your relational data. Alternatively, you can also use 3rd party components such as LuSQL to help make this process painless. The DIH can help make indexing your relational data easier because the SQL statements necessary to construct documents are stored in a configuration file along with mappings of result set fields to Solr document fields. The DIH supports a full import and a delta import configuration so that incremental indexing has a different configuration from full indexing since incremental most likely has a time component and a join on some change log.
Essentially, each row in the result set generated by the SQL statement executed by the DIH now maps to a single Solr document. Once that set of documents – or records, formerly known as rows of data – is indexed, Solr is nominally ready to run queries against those records. Of course, to optimize the query results for relevancy, Solr’s configuration files need to be set with best to weight different aspects of a query, such as boosting particular documents, text weights, field weights etc. (the application programming model for tuning relevancy is one of Solr’s most powerful features, but it’s beyond the scope of this introductory article). During query time, the resulting document returned to the calling component can contain the document identifiers necessary to be able to do a secondary very tightly focused query for extra data from the RDBMS such as quantity-in-stock; images; special offer links, etc.
From a system design perspective, two main concerns are clearly partitioned. The inventory management is done by the RDBMS and the inventory searching is done by Solr. Since the RDBMS has ACID properties, functions such as customer purchase or new stock entry are handled through the database while flexible, full-text faceted searches are handled by Solr. The communication link between Solr and the RDBMS from an indexing perspective can be handled by Solr’s DIH delta-import feature and a UNIX cron job to periodically invoke a Solr + DIH URL to index any changes or by pushing changes directly into Solr by POSTing updates.
To summarize, Solr is not meant to be a replacement for your RDBMS. Rather, Solr should be used to develop the search service aspect of your application by storing only enough information to efficiently query your data source and providing enough information to the calling component to query your RDBMS for additional information. The data stored in the underlying Lucene index is essentially a fully searchable view of your data that resides as a decoupled component in your system. If you find yourself about to ask the question “does Solr allow you to perform a query similar to
SELECT ___ in SQL?” stop and think about your data and the design of your application.
Solr is a search engine meant to efficiently return relevant documents given a user query; it is at its best in tackling diverse data, simplifying logic behind making query results relevant. It is not a full featured reporting platform nor a (transactional) data store with referential integrity constraints. For a simple, yet great, comparison of features of Solr vs. RDBMS, check out http://wiki.apache.org/solr/WhyUseSolr and for a complete rundown of Solr’s DIH component, check out http://wiki.apache.org/solr/DataImportHandler.
Amit Nithianandan is a Senior Search Engineer at Zvents.