I have some Highlighter work that I keep meaning to finish up (basic support for highlighting ConstantScoreQuerys), and so I have the Highlighter on my mind…

The History…

The first Lucene Highlighter was written and contributed to Lucene by Mark Harwood, a longtime Lucene contrib Committer and PMC member. Mark came up with a nice, robust, extensible API and a handful of default implementations for the API. It was a very solid Highlighter implementation that has held up nicely in the face of a lot of complicated Analyzers and Filters. A variety of contributors have enriched the code over the years since then (squashing bugs and making improvements), and the Highlighter is currently fairly capable and heavily used.

The Lucene contrib Highlighter was created with a focus on generating text fragments. This allows you to easily generate ‘keywords in context’ type views (ie the results list from your favorite search engine). Eventually, the NullFragmenter was added, allowing you to highlight a full document as well (you could have used the API to write your own NullFragmenter before Lucene added it – one of the nice things about the Highlighter’s fairly pluggable API).

Scoring and Highlighting…

The Highlighter works with a TokenStream and a Query. A TokenStream is just as it sounds: a stream of tokens – terms even, if thats easier – terms with possibly additional meta-data attached (position, offsets in original text, etc). An Analyzer and a document create a TokenStream – apply the Analyzer to the documents text, and out pops the Tokens. By comparing the tokens from the query with the tokens from the document, the Highlighter can identify which tokens should be highlighted (termFromDoc==termFromQuery? Highlight!). The highlighter works by feeding tokens from the document one at a time to a Scorer. The Scorer assigns a score to the token. The QueryScorer assigns the score based on whether the token matches a token in the query. Fragments are then generated and scored based on the underlying token scores. Generally the token score might just be 0 or 1, but you can do gradient highlighting by expanding the range of the scores (if you pass an IndexReader to QueryScorer, it will use term index stats to modify the score based on those stats). Finally, a pluggable Formatter implementation will actually insert the highlight text (using the score to decide what, if any, text to insert).

Obtaining a TokenStream for a Document…

Unfortunately, the index does not store the TokenStream for a document, so when its time to highlight, its up to the user to get a valid TokenStream for the document text. Generally this means shoving the original text for the document through the Analyzer you used for indexing the document. However, if you stored term vectors in your index, the position and/or offset information can be used to reconstruct the TokenStream from info in the index. Especially for large documents, this can be much faster. The TokenSources class in the Highlighter package will build a TokenStream for you, using the best method based on whether term vectors are available or not.

SpanScorer – adding position sensitive highlighting…

A couple of years ago I became interested in adding positional support to the Highlighter. The QueryScorer implementation just checks that tokens from the query match tokens from the document, and it doesn’t take the position of the tokens into account. The result is that if you use a PhraseQuery, rather than just highlighting the phrase, each term in the phrase will be highlighted everywhere it occurs. Attempts had been made to support PhraseQuery highlighting in the past, but not in a way that took advantage of the current Highlighter framework, and not in a way that supported the other positional queries (SpanQuery, MultiPhraseQuery, etc). I wanted pretty much full Highlighting support, as well as all of the goodness that had been squeezed into the current Highlighter. The result of this desire was a new Scorer implementation called SpanScorer.

The new SpanScorer would put the TokenStream into a fast single doc MemoryIndex, convert the query to a SpanQuery approximation, and call getSpans on the MemoryIndex to get all of the position hits for the document. This info is then used in scoring to filter out query terms that match doc terms, but are not in the correct position. The SpanScorer now supports almost the entire range of Lucene queries, and is just as fast as the QueryScorer for Query clauses that are not position sensitive.

Lucene added the SpanScorer in release 2.4 and Solr has also added support for the SpanScorer in 1.3. To take advantage of the SpanScorer in Lucene, just use the SpanScorer rather than the QueryScorer. You can enable the SpanScorer in Solr by passing hl.usePhraseHighlighter=true with your request.

Other Highlighter Implementations…

In Lucene JIRA there are a couple of other Highlighter implementations as well. The most interesting ideas you will find in them come from the two implementations that require term vectors to be stored for the documents you want to highlight. If you can enforce that requirement (something we don’t yet want to do with the default Highlighter), you can use the approach of looking at just the terms in the query, rather than looking at each of the terms in the document. This can be a very large win on large documents. The downsides are, that its not easy (and it hasn’t been done that I know of) to highlight based on position (phrase/span queries), and the exposed API for custom hooks is less rich. And of course, you have to store TermVectors to use the Highlighter.

About Mark Miller

Read more from this author

Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter! Each month, we handpick the top stories, insights, and updates to keep you in the know.