Spock: The logical thing for you to have done was to have left me behind. McCoy: Mr. Spock, remind me to tell you that I’m sick and tired of your logic.

## Logic in Mathematics vs Logic in Language

Logic is of course fundamental to both mathematics and language. It is especially important in computer engineering as it forms the operational basis of the digital computer. Another area where logic reigns is Set Theory – the mathematics of groups – and it is in this arena where language and search collide because search is all about finding a set of documents that match a given query (sets can have zero or more elements in them). When we focus on the mathematical aspects of sets, we need to define precise operators to manipulate them – intersection, union, exclusion – AND, OR, NOT, etc. Logic in software needs to be explicit or handled with global defaults. Logic in language is contextual – it can be implicit or explicit. An example of implied logic is in the use of adjectives as refiners such as the “red sofa” example that I have been using. Here, the user is clearly looking for things that are sofas AND are red in color. If the user asks for “red or blue sofas”, there are two logical operators, one implied and one explicit – they want to see sofas that are either red or blue. But what if the user asks for “red and blue sofas”? They could be asking to see sofas in both colors if referring to sets, or to individual sofas that have both red and blue in them. So this is somewhat ambiguous because the refinement field “color” is not clearly defined yet – can a single sofa have more than one color or just one? Lets choose something that is definitely single-valued – size. If I say “show me large or extra-large t-shirts” the language use of logic is the same as the mathematical one but if I say “show me large and extra-large t-shirts” it is not. Both of these phrases in language mean the same thing because we instinctively understand that a single shirt has only one size so if we use “and” we mean “show me shirts in both sizes” and for “or” we mean “show me shirts of either size” which in terms of set theory translates to the same operation – UNION or OR. In other words, “and” and “or” are synonyms in this context! For search, only OR can be supported for single-value fields because using AND gives the non result – zero records. The situation is not the same when dealing with attributes for which a single entity can have more than one value. If I say, “show me shirts that are soft, warm and machine-washable” then I mean intersection of these attributes – I only want to see shirts that have all of these qualities. But if I say “show me shirts that are comfortable or lightweight” I expect to see shirts with at least one of these attributes, or in other words the union of comfortable and lightweight shirts. “And” and “or” are now antonyms as they are in mathematics and computer science. It also makes sense from a search perspective because we can use either AND or OR in the context of a multi-value field and still get results. Getting back to implied vs. explicit, it is AND that is implied in this case because I can say “show me soft, warm, machine-washable shirts” which means the same as “soft, warm and machine-washable shirts”. So we conclude that how the vernacular use of “and” and “or” should be interpreted depends on whether the values for that attribute are exclusive or not (i.e. single or multi-valued). That is, “and” means “both” (or “all” if more than two values are given) and “or” means “either” (or “any”, respectively). For “and” if the attribute is single valued we mean “show me both things”, if it is multi-valued we mean “show me things with both values”. For “or”, single valued attributes translate to “show me either thing” and if multi-valued “show me things with either value”.  As Mr. Spock would say, its totally logical (RIP Leonard – we’ll miss you!)

## Implementing Contextual Logic in Search

Armed with a better understanding of how logic works in language and how that relates to the mathematics of search operations, we can do a better job of responding to implied or explicit logic embedded in search queries – IF – we know how terms map to fields and what type of fields they map to. It turns out that the Query Autofiltering component can give us this context – it uses the Lucene FieldCache to create a map of field values to field names – and once it knows what field a part of the query maps to, it knows whether that field is single or multi-valued.  So given this, if there is more than one value for a given field in a query, and that field is single valued, we always use OR. If the field is multi-valued then we use AND if no operator is specified and OR if “or” is used within the positional context of the set of field values.  In other words, we see if the term “or” occurs somewhere between the first and last instances of a particular field value such as in “lightweight or comfortable”. This also allows us to handle phrases that have multiple logical operators such as “soft, warm, machine-washable shirts that come in red or blue”.  Here the “or” does not override the multi-valued attribute list’s implied “and” because it is outside of the list. It instead refers to values of color – which if a single value field in the index is ignored and defaults to OR. Here is the code that does this contextual interpretation. As the sub-phrases in the query are mapped to index fields, the first and last positions of the phrase set are captured. Then if the field is multi-valued, AND is used unless the term “or” has been interspersed:

```  SolrIndexSearcher searcher = rb.req.getSearcher();
IndexSchema schema = searcher.getSchema();
SchemaField field = schema.getField( fieldName );

boolean useAnd = field.multiValued() && useAndForMultiValuedFields;
// if query has 'or' in it and or is at a position
// 'within' the values for this field ...
if (useAnd) {
for (int i = termPosRange[0] + 1; i < termPosRange[1]; i++ ) {
String qToken = queryTokens.get( i );
if (qToken.equalsIgnoreCase( "or" )) {
useAnd = false;
break;
}
}
}

StringBuilder qbldr = new StringBuilder( );
for (String val : valList ) {
if (qbldr.length() > 0) qbldr.append( (useAnd ? " AND " : " OR ") );
qbldr.append( val );
}

return fieldName + ":(" + qbldr.toString() + ")" + suffix;
```

The full source code for the QueryAutofilteringComponent is available on github for both Solr 4.x and Solr 5.x. (Due to API changes introduced in Solr 5.0, two versions of this code are needed.)

## Demo

To show these concepts in action, I created a sample data set for a hypothetical department store (available on the github site). The input data contains a number of fields, product_type, product_category, color, material, brand, style, consumer_type and so on. Here are a few sample records:

```  <doc>
<field name="id">17</field>
<field name="product_type">boxer shorts</field>
<field name="product_category">underwear</field>
<field name="color">white</field>
<field name="brand">Fruit of the Loom</field>
<field name="consumer_type">mens</field>
</doc>
. . .
<doc>
<field name="id">95</field>
<field name="product_type">sweatshirt</field>
<field name="product_category">shirt</field>
<field name="style">V neck</field>
<field name="style">short-sleeve</field>
<field name="brand">J Crew Factory</field>
<field name="color">grey</field>
<field name="material">cotton</field>
<field name="consumer_type">womens</field>
</doc>
. . .
<doc>
<field name="id">154</field>
<field name="product_type">crew socks</field>
<field name="product_category">socks</field>
<field name="color">white</field>
<field name="brand">Joe Boxer</field>
<field name="consumer_type">mens</field>
</doc>
. . .
<doc>
<field name="id">135</field>
<field name="product_type">designer jeans</field>
<field name="product_category">pants</field>
<field name="brand">Calvin Klein</field>
<field name="color">blue</field>
<field name="style">pre-washed</field>
<field name="style">boot-cut</field>
<field name="consumer_type">womens</field>
</doc>
```

The dataset contains built-in ambiguities in which a single token can occur as part of a product type, brand name, color or style. Color names are good examples of this but there are others (boxer shorts the product vs Joe Boxer the brand).  The ‘style’ field is multi-valued. Here is the schema.xml definitions of the fields:

```
<field name="brand"        type="string" indexed="true" stored="true" multiValued="false" />
<field name="color"        type="string" indexed="true" stored="true" multiValued="false" />
<field name="colors"       type="string" indexed="true" stored="true" multiValued="true" />
<field name="material"     type="string" indexed="true" stored="true" multiValued="false" />
<field name="product_type" type="string" indexed="true" stored="true" multiValued="false" />
<field name="product_category" type="string" indexed="true" stored="true" multiValued="false" />
<field name="consumer_type" type="string" indexed="true" stored="true" multiValued="false" />
<field name="style"         type="string" indexed="true" stored="true" multiValued="true" />
<field name="made_in"       type="string" indexed="true" stored="true" multiValued="false" />
```

To make these string fields searchable from a “freetext” – box-and-a-button query (e.g. q=red socks ), the data is copied to the catch-all text field ‘text’:

```
<copyField source="color" dest="text" />
<copyField source="colors" dest="text" />
<copyField source="brand" dest="text" />
<copyField source="material" dest="text" />
<copyField source="product_type" dest="text" />
<copyField source="product_category" dest="text" />
<copyField source="consumer_type"  dest="text" />
<copyField source="style"  dest="text" />
```

The solrconfig file has these additions for the QueryAutoFilteringComponent and a request handler that uses it:

```
<requestHandler name="/autofilter" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">description</str>
</lst>

<arr name="first-components">
<str>queryAutofiltering</str>
</arr>
</requestHandler>

<searchComponent name="queryAutofiltering" class="org.apache.solr.handler.component.QueryAutoFilteringComponent" />
```

Example 1: “White Linen perfume” There are many examples of this query problem in the data set where a term such as “white” is ambiguous because it can occur in a brand name and as a color, but this one has two ambiguous terms “white” and “linen” so it is a good example of how the autofiltering parser works. The phrase “White Linen” is known from the dataset to be a brand and “perfume” maps to a product type, so the basic autofiltering algorithm would match “White” as a color, then reject that for “White Linen” as a brand – since it is a longer match. It will then correctly find the item “White Linen perfume”.  However, what if I search for “white linen shirts”? In this case, the simple algorithm won’t match because it will fail to provide the alternative parsing “color:white AND material:linen”. That is, now the phrase “White Linen” is ambiguous. In this case, an additional bit of logic is applied to see if there is more than one possible parsing of this phrase, so in this case, the parser produces the following query:

```((brand:"White Linen" OR (color:white AND material:linen)) AND product_category:shirts)
```

Since there are no instances of shirts made by White Linen (and if there were, the result would still be correct), we just get shirts back. Similarly for the perfume, since perfume made from linen doesn’t exist, we only get the one product. That is, some of the filtering here is done in the collection. The parser doesn’t know what makes “sense” at the global level and what doesn’t, but the dataset does – so between the two, we get the right answer. Example 2: “white and grey dress shirts” In this case, I have created two color fields, “color” which is used for solid-color items and is single valued and “colors” which is used for multicolored items (like striped or patterned shirts) and is multi valued.  So if I have dress shirts in the data set that are solid-color white and solid-color grey and also striped shirts that are grey and white stripes and I search for “white and grey dress shirts”, my intent is interpreted by the autofiltering parser as “show me solid-color shirts in both white and grey or multi-colored shirts that have both white and grey in them”. This is the boolean query that it generates:

```((product_type:"dress shirt" OR ((product_type:dress OR product_category:dress)
AND (product_type:shirt OR product_category:shirt)))
AND (color:(white OR grey) OR colors:(white AND grey)))
```

(Note that it also creates a redundant query for dress and shirt since “dress” is also a product type – but this query segment returns no results since no item is both a “dress” and a “shirt” – so it is just a slight performance waster). If I don’t want the solid colors, I can search for “striped white and grey dress shirts” and get just those items ( or use the facets). (We could also have a style like “multicolor” vs “solid color” to disambiguate but that may not be too intuitive.) In this case, the query that the autofilter generates looks like this:

```((product_type:"dress shirt" OR ((product_type:dress OR product_category:dress)
AND (product_type:shirt OR product_category:shirt)))
AND (color:(white OR grey) OR colors:(white AND grey))
AND style:striped)
```

Suffice it to say that the out-of-the-box /select request handler doesn’t do any of this. To be fair, it does a good job of relevance ranking for these examples, but its precision (percentage of true or correct positives) is very poor. You can see this by comparing the number of results that you get with the /select handler vs. the /autofilter handler – in terms of precision, its night and day. But is this dataset “too contrived” to be of real significance? For eCommerce data, I don’t think so, many of these examples are real-world products and marketing data is rife with ambiguities that standard relevance algorithms operating at the single token level simply can’t address. The autofilter deals with ambiguity by noting that phrases tend to be less ambiguous than terms, but goes further by providing alternate parsing when the phrase is ambiguous. We want to remove ambiguities that stem from the tokenization that we do on the documents and queries – we cannot remove real ambiguities, rather we need to respond to them appropriately.

## Performance considerations:

On the surface it would appear that the Query Autofiltering component adds some overhead to the query parsing process – how much is a matter for further research on my part – but lets look at the potential cost-benefit analysis. Increasing the precision of search results helps both in terms of result quality and performance, especially with very large datasets because two of the most expensive things that a search engine has to do are sorting and faceting. Both of these require access to the full result set, so fewer false positives means fewer things to sort (i.e. demote) and facet – and overall faster responses. And while relevance ranking can push false positives off of the first page (or shove them under the rug so to speak), the faceting engine cannot – it shows all.  In some examples shown here, the precision gains are massive – in some cases an order of magnitude better. On very large datasets, one would expect that to have a significant positive impact on performance.