In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find.  A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this.  A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns.  The traditional way to do this is to use Natural Language Processing or NLP to parse the user query.  This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”.  In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed.  The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search (Funny story: after posting this, I ran across a Mexican Restaurant in Toms River, New Jersey that does serve sushi – but still, most of them don’t!). The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way.  The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting.  It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”.  It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it.  This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context.  The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata.

Searching vs Navigating

A little historical background here.  With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links.  In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata.  The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference.  Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will).

A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching:

In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use  <field>:<value> syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user.  However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words.  The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful.  So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender.  But the information is still somewhere in the search index, we just need a way to get it back at at “query time”.  Then, we can both have our cake AND eat it!

Noun Phrases and the Hierarchy of meta information

When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing.  In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do.  That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth.  Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types.

Rebuilding Context: Identifying field name patterns to find relevant phrase patterns

So now its time to put Humpty Dumpty back together again.  With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate  metadata mapping on behalf of the user.  We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store.  Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example.

The Query AutoFiltering Search Component

To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things.  The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map.  This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field.  If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query.  The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer.  Assuming that we have indexed the following records:

Field:         color       product_type         brand

Record 1:      red         shoes
Record 2:      red         socks
Record 3:      brown       socks
Record 4:      green       socks                red lion
Record 5:      blue        socks                red lion
Record 6:      blue        socks                red dragon
Record 7:                  pizza                red baron
Record 8:                  whiskey              red label
Record 9:                  smoke detector       red light
Record 10:                 yeast                red star
Record 11:                 red wine             gallo
Record 12:                 red wine vinegar     heinz
Record 13:                 red grapes           dole
Record 14:                 red brick            acme
Record 15:                 red pepper           dole
Record 16:                 red pepper flakes    mccormick

This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records.  However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query.  Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values.  So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”.  Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too.  If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”.  It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type.  From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query:

q=blue red lion socks

to a filter query:

fq=color:blue&fq=brand:”red lion”&fq=product_type:socks

This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%!

Adding case sensitivity and synonym support:

One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality.  To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed.  Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future.

Source Code

For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it).  The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above.

Conclusions

The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post.  Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can.  This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost).  The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications.  The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.