Query Autofiltering Extended – On Language and Logic in Search
Spock: The logical thing for you to have done was to have left me behind. McCoy: Mr. Spock, remind me to tell you that I’m sick and tired of your logic.
This is the third in a series of blog posts on a technique that I call Query Autofiltering – using the knowledge built into the search index itself to do a better job of parsing queries and therefore giving better answers to user questions. The first installment set the stage by arguing that a better understanding of language and how it is used when users formulate queries, can help us to craft better search applications – especially how adjectives and adverbs – which can be thought of as attributes or properties of subject or action words (nouns and verbs) – should be made to refine rather than to expand search results – and why the OOTB search engine doesn’t do this correctly. Solving this at the language level is a hard problem. A more tractable solution involves leveraging the descriptive information that we may have already put into our search indexes for the purposes of navigation and display, to parse or analyze the incoming query. Doing this enables us to produce results that more closely match the user’s intent. The second post describes an implementation approach using the Lucene FieldCache that can automatically detect when terms or phrases in a query are contained in metadata fields and to then use that information to construct more precise queries. So rather than searching and then navigating, the user just searches and finds (even if they are not feeling lucky). An interesting problem developed from this work – what to do when more than one term in a query matches the same metadata field? It turns out that the answer is one of the favorite phrases of software consultants – “It Depends”. It depends on whether the field is single or multi valued. Understanding why this is so leads to a very interesting insight – logic in language is not ambiguous, it is contextual, and part of the context is knowing what type of field we are talking about. Solving this enables us to respond correctly to boolean terms (“and” and “or”) in user queries, rather than simply ignoring them (by treating them as stop words) as we typically do now.
Logic in Mathematics vs Logic in Language
Logic is of course fundamental to both mathematics and language. It is especially important in computer engineering as it forms the operational basis of the digital computer. Another area where logic reigns is Set Theory – the mathematics of groups – and it is in this arena where language and search collide because search is all about finding a set of documents that match a given query (sets can have zero or more elements in them). When we focus on the mathematical aspects of sets, we need to define precise operators to manipulate them – intersection, union, exclusion – AND, OR, NOT, etc. Logic in software needs to be explicit or handled with global defaults. Logic in language is contextual – it can be implicit or explicit. An example of implied logic is in the use of adjectives as refiners such as the “red sofa” example that I have been using. Here, the user is clearly looking for things that are sofas AND are red in color. If the user asks for “red or blue sofas”, there are two logical operators, one implied and one explicit – they want to see sofas that are either red or blue. But what if the user asks for “red and blue sofas”? They could be asking to see sofas in both colors if referring to sets, or to individual sofas that have both red and blue in them. So this is somewhat ambiguous because the refinement field “color” is not clearly defined yet – can a single sofa have more than one color or just one? Lets choose something that is definitely single-valued – size. If I say “show me large or extra-large t-shirts” the language use of logic is the same as the mathematical one but if I say “show me large and extra-large t-shirts” it is not. Both of these phrases in language mean the same thing because we instinctively understand that a single shirt has only one size so if we use “and” we mean “show me shirts in both sizes” and for “or” we mean “show me shirts of either size” which in terms of set theory translates to the same operation – UNION or OR. In other words, “and” and “or” are synonyms in this context! For search, only OR can be supported for single-value fields because using AND gives the non result – zero records. The situation is not the same when dealing with attributes for which a single entity can have more than one value. If I say, “show me shirts that are soft, warm and machine-washable” then I mean intersection of these attributes – I only want to see shirts that have all of these qualities. But if I say “show me shirts that are comfortable or lightweight” I expect to see shirts with at least one of these attributes, or in other words the union of comfortable and lightweight shirts. “And” and “or” are now antonyms as they are in mathematics and computer science. It also makes sense from a search perspective because we can use either AND or OR in the context of a multi-value field and still get results. Getting back to implied vs. explicit, it is AND that is implied in this case because I can say “show me soft, warm, machine-washable shirts” which means the same as “soft, warm and machine-washable shirts”. So we conclude that how the vernacular use of “and” and “or” should be interpreted depends on whether the values for that attribute are exclusive or not (i.e. single or multi-valued). That is, “and” means “both” (or “all” if more than two values are given) and “or” means “either” (or “any”, respectively). For “and” if the attribute is single valued we mean “show me both things”, if it is multi-valued we mean “show me things with both values”. For “or”, single valued attributes translate to “show me either thing” and if multi-valued “show me things with either value”. As Mr. Spock would say, its totally logical (RIP Leonard – we’ll miss you!)
Implementing Contextual Logic in Search
Armed with a better understanding of how logic works in language and how that relates to the mathematics of search operations, we can do a better job of responding to implied or explicit logic embedded in search queries – IF – we know how terms map to fields and what type of fields they map to. It turns out that the Query Autofiltering component can give us this context – it uses the Lucene FieldCache to create a map of field values to field names – and once it knows what field a part of the query maps to, it knows whether that field is single or multi-valued. So given this, if there is more than one value for a given field in a query, and that field is single valued, we always use OR. If the field is multi-valued then we use AND if no operator is specified and OR if “or” is used within the positional context of the set of field values. In other words, we see if the term “or” occurs somewhere between the first and last instances of a particular field value such as in “lightweight or comfortable”. This also allows us to handle phrases that have multiple logical operators such as “soft, warm, machine-washable shirts that come in red or blue”. Here the “or” does not override the multi-valued attribute list’s implied “and” because it is outside of the list. It instead refers to values of color – which if a single value field in the index is ignored and defaults to OR. Here is the code that does this contextual interpretation. As the sub-phrases in the query are mapped to index fields, the first and last positions of the phrase set are captured. Then if the field is multi-valued, AND is used unless the term “or” has been interspersed:
SolrIndexSearcher searcher = rb.req.getSearcher(); IndexSchema schema = searcher.getSchema(); SchemaField field = schema.getField( fieldName ); boolean useAnd = field.multiValued() && useAndForMultiValuedFields; // if query has 'or' in it and or is at a position // 'within' the values for this field ... if (useAnd) { for (int i = termPosRange[0] + 1; i < termPosRange[1]; i++ ) { String qToken = queryTokens.get( i ); if (qToken.equalsIgnoreCase( "or" )) { useAnd = false; break; } } } StringBuilder qbldr = new StringBuilder( ); for (String val : valList ) { if (qbldr.length() > 0) qbldr.append( (useAnd ? " AND " : " OR ") ); qbldr.append( val ); } return fieldName + ":(" + qbldr.toString() + ")" + suffix;
The full source code for the QueryAutofilteringComponent is available on github for both Solr 4.x and Solr 5.x. (Due to API changes introduced in Solr 5.0, two versions of this code are needed.)
Demo
To show these concepts in action, I created a sample data set for a hypothetical department store (available on the github site). The input data contains a number of fields, product_type, product_category, color, material, brand, style, consumer_type and so on. Here are a few sample records:
<doc> <field name="id">17</field> <field name="product_type">boxer shorts</field> <field name="product_category">underwear</field> <field name="color">white</field> <field name="brand">Fruit of the Loom</field> <field name="consumer_type">mens</field> </doc> . . . <doc> <field name="id">95</field> <field name="product_type">sweatshirt</field> <field name="product_category">shirt</field> <field name="style">V neck</field> <field name="style">short-sleeve</field> <field name="brand">J Crew Factory</field> <field name="color">grey</field> <field name="material">cotton</field> <field name="consumer_type">womens</field> </doc> . . . <doc> <field name="id">154</field> <field name="product_type">crew socks</field> <field name="product_category">socks</field> <field name="color">white</field> <field name="brand">Joe Boxer</field> <field name="consumer_type">mens</field> </doc> . . . <doc> <field name="id">135</field> <field name="product_type">designer jeans</field> <field name="product_category">pants</field> <field name="brand">Calvin Klein</field> <field name="color">blue</field> <field name="style">pre-washed</field> <field name="style">boot-cut</field> <field name="consumer_type">womens</field> </doc>
The dataset contains built-in ambiguities in which a single token can occur as part of a product type, brand name, color or style. Color names are good examples of this but there are others (boxer shorts the product vs Joe Boxer the brand). The ‘style’ field is multi-valued. Here is the schema.xml definitions of the fields:
<field name="brand" type="string" indexed="true" stored="true" multiValued="false" /> <field name="color" type="string" indexed="true" stored="true" multiValued="false" /> <field name="colors" type="string" indexed="true" stored="true" multiValued="true" /> <field name="material" type="string" indexed="true" stored="true" multiValued="false" /> <field name="product_type" type="string" indexed="true" stored="true" multiValued="false" /> <field name="product_category" type="string" indexed="true" stored="true" multiValued="false" /> <field name="consumer_type" type="string" indexed="true" stored="true" multiValued="false" /> <field name="style" type="string" indexed="true" stored="true" multiValued="true" /> <field name="made_in" type="string" indexed="true" stored="true" multiValued="false" />
To make these string fields searchable from a “freetext” – box-and-a-button query (e.g. q=red socks ), the data is copied to the catch-all text field ‘text’:
<copyField source="color" dest="text" /> <copyField source="colors" dest="text" /> <copyField source="brand" dest="text" /> <copyField source="material" dest="text" /> <copyField source="product_type" dest="text" /> <copyField source="product_category" dest="text" /> <copyField source="consumer_type" dest="text" /> <copyField source="style" dest="text" /> <copyField source="made_in" dest="text" />
The solrconfig file has these additions for the QueryAutoFilteringComponent and a request handler that uses it:
<requestHandler name="/autofilter" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <int name="rows">10</int> <str name="df">description</str> </lst> <arr name="first-components"> <str>queryAutofiltering</str> </arr> </requestHandler> <searchComponent name="queryAutofiltering" class="org.apache.solr.handler.component.QueryAutoFilteringComponent" />
Example 1: “White Linen perfume” There are many examples of this query problem in the data set where a term such as “white” is ambiguous because it can occur in a brand name and as a color, but this one has two ambiguous terms “white” and “linen” so it is a good example of how the autofiltering parser works. The phrase “White Linen” is known from the dataset to be a brand and “perfume” maps to a product type, so the basic autofiltering algorithm would match “White” as a color, then reject that for “White Linen” as a brand – since it is a longer match. It will then correctly find the item “White Linen perfume”. However, what if I search for “white linen shirts”? In this case, the simple algorithm won’t match because it will fail to provide the alternative parsing “color:white AND material:linen”. That is, now the phrase “White Linen” is ambiguous. In this case, an additional bit of logic is applied to see if there is more than one possible parsing of this phrase, so in this case, the parser produces the following query:
((brand:"White Linen" OR (color:white AND material:linen)) AND product_category:shirts)
Since there are no instances of shirts made by White Linen (and if there were, the result would still be correct), we just get shirts back. Similarly for the perfume, since perfume made from linen doesn’t exist, we only get the one product. That is, some of the filtering here is done in the collection. The parser doesn’t know what makes “sense” at the global level and what doesn’t, but the dataset does – so between the two, we get the right answer. Example 2: “white and grey dress shirts” In this case, I have created two color fields, “color” which is used for solid-color items and is single valued and “colors” which is used for multicolored items (like striped or patterned shirts) and is multi valued. So if I have dress shirts in the data set that are solid-color white and solid-color grey and also striped shirts that are grey and white stripes and I search for “white and grey dress shirts”, my intent is interpreted by the autofiltering parser as “show me solid-color shirts in both white and grey or multi-colored shirts that have both white and grey in them”. This is the boolean query that it generates:
((product_type:"dress shirt" OR ((product_type:dress OR product_category:dress) AND (product_type:shirt OR product_category:shirt))) AND (color:(white OR grey) OR colors:(white AND grey)))
(Note that it also creates a redundant query for dress and shirt since “dress” is also a product type – but this query segment returns no results since no item is both a “dress” and a “shirt” – so it is just a slight performance waster). If I don’t want the solid colors, I can search for “striped white and grey dress shirts” and get just those items ( or use the facets). (We could also have a style like “multicolor” vs “solid color” to disambiguate but that may not be too intuitive.) In this case, the query that the autofilter generates looks like this:
((product_type:"dress shirt" OR ((product_type:dress OR product_category:dress) AND (product_type:shirt OR product_category:shirt))) AND (color:(white OR grey) OR colors:(white AND grey)) AND style:striped)
Suffice it to say that the out-of-the-box /select request handler doesn’t do any of this. To be fair, it does a good job of relevance ranking for these examples, but its precision (percentage of true or correct positives) is very poor. You can see this by comparing the number of results that you get with the /select handler vs. the /autofilter handler – in terms of precision, its night and day. But is this dataset “too contrived” to be of real significance? For eCommerce data, I don’t think so, many of these examples are real-world products and marketing data is rife with ambiguities that standard relevance algorithms operating at the single token level simply can’t address. The autofilter deals with ambiguity by noting that phrases tend to be less ambiguous than terms, but goes further by providing alternate parsing when the phrase is ambiguous. We want to remove ambiguities that stem from the tokenization that we do on the documents and queries – we cannot remove real ambiguities, rather we need to respond to them appropriately.
Performance considerations:
On the surface it would appear that the Query Autofiltering component adds some overhead to the query parsing process – how much is a matter for further research on my part – but lets look at the potential cost-benefit analysis. Increasing the precision of search results helps both in terms of result quality and performance, especially with very large datasets because two of the most expensive things that a search engine has to do are sorting and faceting. Both of these require access to the full result set, so fewer false positives means fewer things to sort (i.e. demote) and facet – and overall faster responses. And while relevance ranking can push false positives off of the first page (or shove them under the rug so to speak), the faceting engine cannot – it shows all. In some examples shown here, the precision gains are massive – in some cases an order of magnitude better. On very large datasets, one would expect that to have a significant positive impact on performance.
Autofiltering vs. Autophrasing
Awhile back, I introduced another solution for dealing with phrases and handling multi-term synonyms called “autophrasing” (1,2). What is the difference between these two things? They basically do the same thing – handle noun phrases as single things, but use different methods and different resources. Both can solve the multi-term synonym problem. The autophrasing solution requires an additional metadata file “autophrases.txt” that contains a list of multi-word terms that are used to represent singular entities. The autofiltering solution gets this same information from collection fields so that it doesn’t need this extra file. It can also work across fields and can solve other problems such as converting colloquial logic to search engine logic as discussed in this post. In contrast, the autophrasing solution lacks this “relational context” – it knows that a phrase represents a single thing, but it doesn’t know what type of thing it is and how it relates to other things. Therefore, it can’t know what to do when user queries contain logical semantic constructs that cross field boundries. So, if there already is structured information in your index, which is typically the case for eCommerce data, use autofiltering. Autophrasing is more appropriate when you don’t have structured information – as with record sets that have lots of text in them (bibliographic) – and you simply want phrase disambiguation. Or, you can generate the structured data needed for autofiltering by categorizing your content using NLP or taxonomic approaches. The choice of categorization method may be informed by the need to have this “relational context” that I spoke about above. Taxonomic tagging can give this context – a taxonomy can “know” things about terms and phrases like what type of entities that they represent. This gives it an advantage over machine learning classification techniques where term relationships and interrelationships are statistically rather than semantically defined. For example, if I am crawling documents on software technology and encounter the terms “Cassandra”, “Couchbase”, “MongoDB”, “Neo4J”, “OrientDB” and NoSQL DB”, both machine learning and taxonomic approaches can determine/know that these terms are related. However, the taxonomy understands the difference between a term that represents a class or type of thing (“NoSQL DB”) and an instance of a class/type (“MongoDB”) where as an ML classifier would not – it learns that they are related but not how those relationships are structured, semantically speaking. The taxonomy would also know that “Mongo” is a synonym for “MongoDB”. It is doubtful that an ML algorithm would get that. This is a critical aspect for autofiltering because it needs to know both what sets of tokens constitute a phrase and also what those phrases represent. Entity extraction techniques can also be used – regular expression, person, company, location extractors – that associate a lexical pattern with a metadata field value. Gazetteer or white-list entity extractors can do the same thing for common phrases that need to be tagged in a specific way. Once this is done, autofiltering can apply all of this effort to the query, to bring that discovered context to the sharp tip of the spear – search-wise. Just as we traditionally apply the same token analysis in Lucene/Solr to both the query and the indexed documents, we can do the same with classification technologies. So it is not that autofiltering can replace traditional classification techniques – these typically work on the document where there is a lot of text to process. Autofiltering can leverage this effort because it works on the query where there is not much text to work with. Time is also of the essence here. We can’t afford to use expensive text crunching algorithms as we do when we index data (well … sort of), because in this case we are dealing with what I call HTT – Human Tolerable Time. Expect a blog post on this in the near future.
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.