Recently, I did some minor work on improving the usability of the Lucene spell checker (see LUCENE-2479, LUCENE-2608 and the associated Solr work) and it got me thinking that a post on spell checking in Solr would be useful.
For those who aren’t familiar, the notion of spell checking in search (often called Did You Mean?) is slightly different from the notion of simply correcting spelling errors. It’s not that we don’t want to correct misspelled words, it’s more that we want to give suggestions for words that will lead to better results based on the way things are spelled in the index as well as other factors like past user behavior, the “correct” spelling of the word and any other apriori information, such as business goals, we might have. For instance, it may be the case that a word is so often misspelled by writers in your corpus that the best suggestion just might be an incorrectly spelled word, even if the user’s original query was properly spelled! For some background on building the foundation of a spell checker, see Peter Norvig’s excellent post.
To understand spell checking in Solr, it is helpful to know a bit more about what is going on underneath the hood. There are several working parts to the spell checker, some in Solr and some in Lucene.
Starting with Solr, the primary mechanism for delivering spelling corrections is through a Search Component called the Taming Text co-author Tom Morton has written a full chapter on fuzzy string matching, including spell checking, for our book. It is chapter 4 and is currently available in MEAP.) Once the candidates are scored, they are added to a Priority Queue and then the top X results are returned, where X is an input parameter to the method call.
With the background out of the way, let’s take a look at it in action.
Setup of the SpellCheckComponent is pretty easy. In the solrconfig.xml, we need to declare a <searchComponent> and then configure it. The Solr tutorial, for instance, has:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">name</str> <str name="spellcheckIndexDir">./spellchecker</str> </lst> <!-- a spellchecker that uses a different distance measure <lst name="spellchecker"> <str name="name">jarowinkler</str> <str name="field">spell</str> <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str> <str name="spellcheckIndexDir">./spellchecker2</str> </lst> --> <!-- Use an alternate comparator --> <!--<lst name="spellchecker"> <str name="name">freq</str> <str name="field">lowerfilt</str> <str name="spellcheckIndexDir">spellcheckerFreq</str> <!– comparatorClass be one of: 1. score (default) 2. freq (Frequency first, then score) 3. A fully qualified class name –> <str name="comparatorClass">freq</str> <str name="buildOnCommit">true</str> --> <!-- a file based spell checker <lst name="spellchecker"> <str name="classname">solr.FileBasedSpellChecker</str> <str name="name">file</str> <str name="sourceLocation">spellings.txt</str> <str name="characterEncoding">UTF-8</str> <str name="spellcheckIndexDir">./spellcheckerFile</str> </lst> --> </searchComponent>
While this setup shows a number of different ways to set up the spell checker, I’m going to focus on the key moving parts. The first thing to notice is the queryAnalyzerFieldType. This tells the spell checker how to tokenize and otherwise analyze the incoming query to prep it for spell checking. Generally speaking, it should be a FieldType that can produce tokens that match the analysis used to create tokens in your spelling index/dictionary. If you are using the Lucene spell checker, it should match the analysis of the source Field used to generate the spelling index (in this case the “name” field). The other thing to notice is the declarations of the spellCheckers (the <lst> elements). In this case, we have one declared spell checker. It is a Lucene based one (which is the default) and it is being built from the “name” field in the schema. The other spell checkers are all commented out, but showcase the various different configuration options available.
The second piece of configuration, and the one that most commonly trips people up, is the addition of the Search Component to a Request Handler. The reason why it commonly trips people up is that they add the SpellCheckComponent to a different Request Handler than their primary search request handler, thus requiring them to make two separate requests to Solr, one for the search results and one for the spelling suggestions. Instead, the SpellCheckComponent should be hooked directly into the main Request Handler, thus saving one round trip to Solr. The configuration should look something like:
<requestHandler name="/myMainRequestHandler" class="solr.SearchHandler" lazy="true"> <lst name="defaults"> <str name="spellcheck.onlyMorePopular">false</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
Again, I can’t stress it enough, the SpellCheckComponent should not be placed in a separate Request Handler that thus requires two calls to Solr, despite the fact that the Solr tutorial does this for demonstration purposes (see the very large comment right above it).
Once the spell checkers are setup and Solr is up in running, you can issue queries to it. If you are using the Lucene spell checker or others, you may first need to build the underlying index. See http://wiki.apache.org
Once built, usage of the spell checker is pretty straightforward. In your Solr request or as part of your Request Handler configuration, you need to turn on the component (&spellcheck=true) and specify various other parameters to tell it how you want your results.
Based on my experience, the spell checker does a decent job out of the box, but not great, so you should be prepared to spend some time tuning it. First off, make sure you are doing effective analysis of the source content. See http://wiki.apache.org
Also note, the current collate functionality in the SpellCheckComponent has some warts that may prevent it’s effective use. However, the community is working through a fix to this right now, so keep an eye on SOLR-2010.
Finally, I still have a lot to learn about spell checking in search, so I’d appreciate your feedback on what worked and didn’t work for you in your applications. Please provide your tips below so we can all learn.