Query Autofiltering is autotagging of the incoming query where the knowledge source is the search index itself.  What does this mean and why should we care?

Content tagging processes are traditionally done at index time either manually or automatically by machine learning or knowledge based (taxonomy/ontology) approaches. To ‘tag’ a piece of content means to attach a piece of metadata that defines some attribute of that content (such as product type, color, price, date and so on). We use this now for faceted search – if I search for ‘shirts’, the search engine will bring back all records that have the token ‘shirts’ or the singular form ‘shirt’ (using a technique called stemming).  At the same time, it will display all of the values of the various tags that we added to the content at index time under the field name or “category”  of these tags.  We call these things facets. When the user clicks on a facet link, say color = red, we then generate a Solr filter query with the name / value pair of <field name> = <facet value> and add that to the original query. What this does is narrow the search result set to all records that have ‘shirt’ or ‘shirts’ and the ‘color’ facet value of ‘red’.

Another benefit of faceting is that the user can see all of the colors that shirts come in, so they can also find blue shirts in the same way. But what if they are impatient and type in ‘blue shirts’ into the search box? The way things work now, the search engine will return records that contain the word ‘shirt’ or ‘shirts’ OR the word ‘blue’. This will be partially successful in that blue shirts will be part of the result set but so will red, green, orange and yellow shirts because they all have the term ‘shirt’ in common.  (This will happen if the product description is like – “This is a really nice shirt. It comes in red, blue, green, orange and yellow.”)  Worse, we will also get other blue things like sweaters, pants, socks, hats, etc. because they all have the word ‘blue’ in their description.

Ah, you say but you can then use faceting to get what you really want, click on the shirt product type facet and the color facet for blue. But why should we make the user do this? Its annoying to first see a bunch of stuff that they didn’t ask for and then have to click things to get what they want. They wanted to make things easier for us by specifying what color they want up front and we responded by making things worse for them. But we don’t have to – like Dorothy in the Wizard of Oz, we already have the information that we need to “do the right thing” we just don’t use it. Hence query autofiltering.

There is another twist here that we should consider.  Traditionally, tagging operations, whether manual or automated are applied at index time for “content enrichment”. What we are effectively doing is adding knowledge to the content – a tag tells the search engine that this content has a particular attribute or property. But what if we do this to the incoming quey?  We can, because like content, queries are just text and there is nothing stopping us from applying the same techniques to them (here it must be autotagging though if we want the results in under a second). However, we generally don’t do this because we don’t want to change the query in such a way as to misdirect the search engine so that it returns the ‘wrong’ results – i.e., we don’t want to screw it up so we leave it alone. We know that if we use OR by default, then the correct results should be in there “somewhere” and the user can then use facets to “drill-in” to find what they are looking for.

We also use OR by default because we are afraid to use AND – which can lead to the dreaded ZERO results. This seems like failure to us, we haven’t returned ANYTHING, ZIP, NADA – whats wrong with us?  But is this correct? What if the thing that the user actually wants to find doesn’t exist in the search collection? Should we return ZERO results or not?  Right now we don’t do that.  We return a bunch of results which when the user tries to drill in they keep hitting a dead end. Suppose that we don’t carry purple socks. If a user searches for this, we will return socks that are not purple and purple things that are not socks. If having to drill in to find what they want after telling us that in the query is frustrating, having to drill in and coming up empty handed again and again is VERY frustrating. Why can’t we just admit up front that we don’t sell that thing? We are not going to get their business on that item anyway because we don’t sell it, but why piss them off in the process of not finding it?

We don’t because we are not confident in the accuracy of our search results, i.e. we don’t know either, so we don’t want to lose business by telling them that we don’t carry something when in fact we do. So we use faceting as a net to catch the stray fish. But sometimes we are confident enough that we know what the user wants, that we can override the search engine and give them the right answer.  One way is best bets, also known as landing pages or spotlighting. This requires that a human determine what to bring back for a given set of query phrases – a very labor intensive way of doing it, but its the only way we know. We trust this because a person has looked at the queries and figured out what is being searched for.

Another way cool example of this is getting the local weather forecast on Google or Bing. I call this technique “inferential search” – the process of inferring what the user is looking for and then returning just that (or putting that at the top). This works by introspecting the query and then redirecting or augmenting the search output based on that up-front analysis. Getting back to autotagging, this is what I was talking about earlier – why don’t we do this at query time? We have already built up this knowledge base that we call our search index by using tagging (manual and automated) processes at index time – why don’t we just use that knowledge at query time? We already know how to use it with faceting to get the user to the right answer but why make the user wait or do more work when they don’t have to? Now do you see what query autofiltering is good for? We can use it to short-circuit this frustrating process so that when the user searches for “blue shirts” all that we show them are blue shirts! We can do this because we told the search engine that ‘blue’ is a ‘color’ by adding this as one of the values of the color facet – i.e. the search index already “knows” this linguistic truth because we told it that when we tagged stuff. So the confidence that we built into our faceting engine can be used at query time to do the same thing – we see ‘blue’ in the query, we pull it out and make it a filter query as if the user had first searched for ‘shirts’ and then clicked on the blue color facet.

It turns out that Lucene-Solr makes this really easy to do, the Lucene index contains a data structure called the FieldCache that contains all of the values that were indexed for a particular field. This will be renamed to “UninvertedIndex” or something in Lucene 5.0 – a kind of search-wonkish way of saying that it’s a forward index – not the normal “inverted index” that search engines use. An inverted index allows us to ask – give me all of the documents that contain this term value in this field. An uninverted index allows us to ask – what are all the term values that were indexed in this field?  So once we have tagged things, we can then at query time determine if any of these tag values are in the query and thanks to Lucene, we can do this very quickly (with other search engines, we may not be able to do this at all because faceting is calculated at index time and there may be no equivalent of the FieldCache – but you can still use white lists). But getting back to ZERO results, we can now confidently say that “No, we don’t carry purple socks” or orange rugs because we now know that there is no sock or rug that we ever tagged that way. We should also make the “No Results” message friendlier rather than to suggest that the user did something wrong, as we do now. (Shamefully, we never admitted that we might have been the real culprit– i.e. we take the Fifth.)

Query autofiltering is therefore a very powerful technique because it leverages all of the knowledge that we put into our search index and allows us to have much more confidence that we are “doing the right thing” when we mess around with the original query. A simple implementation of this idea is available for download on github. One of the advantages of this approach is that it uses the same index that the query is targeted at via the FieldCache – and its very fast.  A disadvantage is that it requires ‘exact-match’ semantics so it can’t deal with things like synonyms – i.e. what if the user asked to see a “red couches” rather than “red sofas” and we wanted to autofilter on product type – or “navy shirts”? (The solution here would be to have a multi-value facet field that we would be used for autofiltering.)  Because of the exact-match semantics, we also have to ensure that things like case or stemming don’t affect us. Because of this, we may have to do a bit of preprocessing of the query so that it matches what we indexed, while being careful not to distort the pieces of the original query that we will pass through as-is to Solr.

Another disadvantage is that the query autofiltering Search Component works with a single field at a time so to do mutiple fields, we need multiple component stages.  A good use case is in classic cars where you want to detect year, make and model as in “1967 Ford Mustang convertible”.

Another approach that was suggested by Erik Hatcher, is to have a separate collection that is specialized as a knowledge store and query it to get the categories with which to autofilter on the content collection. This is a less “brute-force” (knee-jerk?) method which makes use of a collection that contains “facet knowledge”.  We can pre-query this collection to learn about synonyms and categories, etc. using all of the cool tricks that we have built into search – fuzzy search to handle misspellings, proximity, multi-term fixes such as autophrasing and more (pivot facets and stats – Oh My!) – the possibilities are literally endless here. The results of this pre-query phase can then drive autofiltering or a less aggressive strategy such as spotlighting or query suggestion (like you do with spell correction). Unlike the FieldCache approach, it can be externalized into a Query Pipeline stage such as featured by our Lucidworks Fusion product. The key is that in both cases, we are using the search index itself as a knowledge source that we can use for intelligent query introspection and thus powerful inferential search!!

Thanks again to Erik Hatcher for suggesting and collaborating with me on this idea. The “red sofa” problem was discussed in my post “The Well Tempered Search Application – Fugue”.  I had originally thought to use a white list to solve this problem. Erik suggested that we use the field values that are in the index and this suggestion “closed the loop” for me – i.e. major lightbulb moment. Erik is a genius. He is well known for his books on Ant and Lucene and is also an extremely nice guy. One of my greatest joys in coming to Lucidworks was being able to work with people like Erik, Grant Ingersoll, Tim Potter and Chris Hostetter – better known as Hossman or simply Hoss.  These guys are luminaries in the Lucene-Solr world but the list of my extraordinary colleagues at Lucidworks goes on and on.