Solr 1.3 and 1.4 moved away from using BooleanQuery expansion for MultiTerm queries and to a ConstantScoreQuery method. In Lucene, a MultiTerm query is a query that expands to match multiple terms based on a given input. Common MultiTerm queries are wildcard, fuzzy, prefix, and range queries. Originally, Lucene supported these MultiTerm queries with an implementation that enumerated the matched terms and then added each as a clause to a BooleanQuery. This is a common approach, but it has some problems. A BooleanQuery with thousands of terms is not likely to perform well. As a precaution against this performance trap, Boolean expansion queries throw a TooManyClauses exception at a default of 1024 clauses. This is a configurable setting, but the underlying performance issue remains if you raise it.
As a response to this performance pitfall on very large indices’s (and the infamous TooManyClauses exception), new queries were developed that relied on a new Query class called ConstantScoreQuery. ConstantScoreQuerys accept a filter of matching documents and then score with a constant value equal to the boost. Depending on the qualities of your index, this method can be faster than the Boolean expansion method, and more importantly, does not suffer from TooManyClauses exceptions. Rather than matching and scoring n BooleanQuery clauses (potentially thousands of clauses), a single filter is enumerated and then traveled for scoring. On the other hand, constructing and scoring with a BooleanQuery containing a few clauses is likely to be much faster than constructing and traveling a Filter.
Depending on the characteristics of your index, either method might be faster than the other, and in the longer term, we hope to have MultiTerm queries that use a heuristic to decide which method to use. The first likely step in this will be to produce an efficient ConstantScore BooleanQuery, as you don’t want scoring to magically change as you meet different heuristics.
In any case, by Solr 1.3, Solr had moved to using ConstantScoreQueries for range, wildcard, and prefix queries. This made things a bit nicer in regard to the large index installations out there (no more TooManyClauses exceptions to worry about), but there was a sacrifice – whereas all of these queries were highlight-able before, after this change, none of them were. The Lucene Highlighter has never been able to highlight ConstantScore queries, as it relies on the query being able to provide the terms it matches, or to rewrite to a query that can – ConstantScore queries do neither. Solr, of course, uses the Lucene Highlighter under the covers.
To get proper highlighting back into Solr, I had to go to the Lucene level and make the MultiTerm queries all able to be switched between boolean expansion and constant score mode. Until that time, a handful of the MultiTerm queries had a separate ConstantScoreQuery implementation (eg PrefixQuery and ConstantScorePrefixQuery), and a couple had no ConstantScore implementation. After changing things so that all MultiTerm queries could be changed between boolean and ConstantScore mode in a single Query object, I was able to modify the Lucene SpanScorer Highlighter so that it looks for a MultiTerm query instance as it runs through the Query object, makes a copy of it, sets the copy to Boolean Expansion mode and rewrites it against a MemoryIndex containing just the doc to be highlighted. So while the query will be applied to the index as a ConstantScore query, the Highlighter will flip a copy of it to Boolean Expansion mode, and get the terms from the rewritten query (which will be a highlight-able BooleanQuery). While the two modes score differently, they will match the same terms. You will still have a possible TooManyClauses issue to contend with when using the Highlighter, but it will be heavily mitigated as the query will be rewritten only against the single doc being highlighted, not the entire index. It should not present the same performance issues either, as the query will never be applied to the full index, but just a single document (and then only getSpans is called).
After making these changes in Lucene, I was able to push the fix into Solr. The fix requires that you use the PhraseHighlighter mode (hl.usePhraseHighlighter uses the Lucene SpanScorer), and to ease back compat issues, that you specify the Highlighter should rewrite intelligently with the SpanScorer rather than once on the full index before highlighting. I called this option hl.highlightMultiTerm and it only works in conjunction with hl.usePhraseHighlighter.
With both of these parameters set to true, you will be able to highlight wildcard queries once again in Solr 1.4, along with all of the other MultiTerm queries.
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.