Wildcard query terms aren’t analyzed, why is that?

Prior to the current 3x branch (which will be released as 3.6) and the trunk (4.0) Solr code, users have frequently been perplexed by wildcard searching being un-analyzed, often manifesting in case sensitivity. Say you have an analysis chain in your schema.xml file defined as follows and a field named lc_field of this type:

<fieldType name="lowercase" class="solr.TextField" >
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.LowercaseFilterFactory" />
</fieldType>

Now, you index the text “My Dog Has Fleas”. So far, so good. Searching on this field as
field_lc:fleas returns the document, as does field_lc:flea*.

But now you search on field_lc:Flea* and you don’t get any results. What?!?!?! Nearly everyone scratches their heads about this, and it’s a question that often comes up on the Solr user’s list. Users wonder why the analysis chain above isn’t applied to the wildcard queries. It turns out that it’s trickier than you might think at first. What happens when a single input term gets split up into multiple parts? For instance, for those of you familiar with WordDelimiterFilterFactory (WDFF) that can split on case change. What does it mean to parse ‘fleA*’? Applying WDFF might well give the two tokens ‘fle’ and ‘A’ and possibly ‘fleA’. If a wildcard is present, what tokens should be emitted?

    1. ‘fleA*’
    2. ‘fle*’, ‘A*’, ‘fleA*’
    3. ‘fle*’, ‘A*’
    4. <insert your solution here>

You can, I daresay, create any rule that suits your fancy. And it’ll be wrong in some situations. Of particular horror is anything that produces ‘A*’ as above, conceptually, you’d than have an enormous OR clause consisting of all the terms that started with ‘A’ in your index. Unless you had a rule like “only do this if the preceding fragment was 2 characters or more”. But then someone would say “I need three characters”, so can WDFF provide a “wildCardMin=#” parameter? I have trouble keeping all the parameters with WDFF and how they interact in my mind already, going down this path would be a nightmare. And I haven’t even considered some of the really interesting issues, like how proximity would be incorporated in all this.

Wildcards aren’t the only issue

The same issue occurs with accent folding, normalizations, and, really, any other component of an analysis chain that somehow changes the query terms. This behavior has mostly been ignored in releases past, it’s been up to the application programmer to manually “do the right thing” before sending the query to Solr. This often involves operations such as lower-casing and accent folding on the application side when a wildcard is encountered.

The new way of handling these cases

As of SOLR-2438 this behavior is no longer true for a number of the most common cases. A query analysis chain that contains any of the following components will automatically “do the right thing” and apply them for multi-term queries. If your analysis chain consists of any of these elements, and you want them applied to “multi-term” queries, you don’t have to do anything at all, it will “just work”. At query time, the indicated transformations are applied to the query terms and everyone is happy. Or should be. Do note that it’s an all-or-nothing operation. All of the elements below that are found in the query analysis chain are applied to the multi-term terms (Solr 3.6+)

    • ASCIIFoldingFilterFactory
    • LowerCaseFilterFactory
    • LowerCaseTokenizerFactory
    • MappingCharFilterFactory
    • PersianCharFilterFactory

In addition, more filters have been added in the 4.0 release of Solr, the list is below. NOTE: this is current as of October, 2012. More may be added! If in doubt, take a look at the class definition for your favorite FilterFactory and see if it’s declared as “.. implements MultiTermAwareComponent”.

    • ASCIIFoldingFilterFactory
    • ArabicNormalizationFilterFactory
    • CJKWidthFilterFactory
    • CollationKeyFilterFactory
    • ElisionFilterFactory
    • GermanNormalizationFilterFactory
    • GreekLowerCaseFilterFactory
    • HindiNormalizationFilterFactory
    • ICUCollationKeyFilterFactory
    • ICUFoldingFilterFactory
    • ICUNormalizer2FilterFactory
    • ICUTransformFilterFactory
    • IndicNormalizationFilterFactory
    • IrishLowerCaseFilterFactory
    • JapaneseIterationMarkCharFilterFactory
    • LowerCaseFilterFactory
    • MappingCharFilterFactory
    • PersianCharFilterFactory
    • PersianNormalizationFilterFactory
    • TurkishLowerCaseFilterFactory

Again, this effectively means you don’t need to care about these transformations any more. One note of explanation, though. I’ve talked about the “query analysis chain”. But what if you don’t have one? Remember that your <analyzer>tag can have several possible ‘type’ parameters; “index”, or “query”, or none. Well, if a ‘ type=”query” ‘ is found, that analysis chain is inspected and any of the above components are recorded to be used on multi-term queries. If no ‘ type=”query” ‘ is found, then the ‘ type=”index” ‘ is used. And if no ‘ type=”index” ‘ is found, than the one with no ‘type’ parameter is used.

What does “multi-term” mean anyway?

I’ve also sprinkled the phrase “mult-term” around, and sometimes “wildcard”. It turns out that the simple wildcard case is a specialization of a broader category of queries, including:

    • wildcard
    • range
    • prefix

All of these are now handled as above.

Expert level schema possibilities

All of the above is automatic, but there are three immediate questions:

    • what about some of the other components?
    • what if I need the old behavior?
    • what if I want something completely different?

It turns out that all three of these questions have the same answer. But before I outline it, I want to emphasize that you very probably don’t need to care about what follows! You might need to know about this in special cases, so I’ll mention it here.

In the above explanations, I wrote that “analysis chain is inspected and any of the above components are recorded to be used on multi-term queries”. Well, what actually happens is that there’s a new analysis chain in town that can be specified in the schema.xml file called, you guessed it, “multiterm”. You specify it like this as part of a <fieldType>:

<analyzer type="multiterm" >
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory" />
  <filter class="solr.YourFavoriteFilterFactoryHere" />
</analyzer>

You can put any component that’s legal in a ‘type=”index” ‘ or ‘type=”query” ‘ analysis chain. If you wanted, for instance, to enforce the old-style behavior, you could specify

  <tokenizer class="solr.KeywordTokenizerFactory" />

as the entire “multiterm” analysis chain. It seems a bit odd to use KeywordTokenizerFactory here, but this applies to the individual terms, not the entire input. So it’s in effect saying “don’t analyze the terms at all”. Sound familiar? This is just what happened historically.

How does this relate to the automatic behavior?

Well, what really happens under the covers is that if you don’t define your own “multiterm” analysis chain, Solr constructs one for you from the analyzers you have defined as outlined above; query, index or default, in that order.

Waaaaay under the covers, down in the code

All this is accomplished by making components “multiterm aware”. This means implementing the “MultiTermAwareComponent” interface. Currently, the components listed above are the only ones that implement this interface, but others may be good candidates, and some of these are listed in JIRA SOLR-2921. By and large, implementing these in the code may be trivial. What’s not trivial is understanding what “the right thing” is. Some examples:

    • stemmers
    • various language-specific normalization filters
    • various language-specific lowercase filters.
    • various ICU filters

The reason these haven’t been made “multi term aware” is the usual open-source reason; “What we have is a good step forward, we shouldn’t delay everything in order to get the last use cases taken care of”. In other words the implementors (me in this case, with lots of help from others) are tired ;).

Anyone who really understands what the right thing to do in the cases of components that do not yet implement “MultiTermAwareComponent” and could provide use cases for them would be giving us a great help, especially by providing examples illustrating the correct inputs and outputs for wildcard cases. And some examples of what should not come out as well. Or even better, a draft JUnit test that would show the expected behavior. Or even better yet, a full patch!

Any modification that potentially produces more than one token needs to be handled with care, see the code for LowerCaseTokenizerFactory for a case in point. Consider that Solr will now throw an exception if the transformation produces more than one token, so tread cautiously!

This change should remove a long-standing point of confusion for solr users. We’d be very interested in any feedback from the community, and especially any problems that crop up. SOLR-2438 has patches for both the 3x and 4x code lines, but it’s probably easier just to get a current 3x or 4x branch (or nightly build) if you want to test this “in the wild”; the code has been committed and built. There remains some work to be done to incorporate this change for more analysis components, anyone want to volunteer?

Resources:

This page on the Solr Wiki has the Wiki documentation: Multi Term Query Analysis

Main JIRA (already in 3.6 and 4.0 code lines): SOLR-2438

JIRA for other components not yet “multi-term aware” that are possibilities in the future: SOLR-2921

About Erick Erickson

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.