From time to time, users on the Lucene mailing list ask a variant of the following question:

Given a term match in a document, what’s the best way to get a window of words around that match?

Getting a window of words around a match can be useful for a lot of things, including, to name a few:

  1. Highlighting (although I’d recommend using Lucene’s Highlighter package for that)
  2. Co-occurrence analysis
  3. Sentiment analysis
  4. Question Answering

Unfortunately, given how inverted indexes are structured, retrieving content around a match isn’t efficient without doing some extra work during indexing.  In Lucene, this “extra work” involves creating and storing Term Vectors with position and offset information.

Storing Term Vector info can be done by adding in the appropriate code during Field construction, as in the following indexing example where I create an index from a few dummy documents (complete code is at the bottom of this post):

    RAMDirectory ramDir = new RAMDirectory();
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < DOCS.length; i++){
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();

Notice the use of the Field.TermVector.WITH_POSITIONS_OFFSETS when constructing the text Field.  This tells Lucene to store term vector information on a per document basis (in other words, not inverted) with both Position and Offset information.  (Due note, other storage options are available, see Field.TermVector.  Also note, storing Term Vectors will cost you in disk space.)

For completeness, the DOCS array looks like:

public static String [] DOCS = {
        "The quick red fox jumped over the lazy brown dogs.",
        "Mary had a little lamb whose fleece was white as snow.",
        "Moby Dick is a story of a whale and a man obsessed.",
        "The robber wore a black fleece jacket and a baseball cap.",
        "The English Springer Spaniel is the best of all dogs."
    };

Now that we have an index created, we need to do a search.  In our case, we need to do a position-based search as opposed to the more traditional document-based search.  In other words, it is not good enough to simply know whether a term is in a document or not (think TermQuery), we need to know where in the document the match occurred.  Lucene enables position-based search through a series of Query classes collectively known as Span Queries.  (See SpanQuery and its derivitaves in the org.apache.lucene.search.spans package.)

Again, an example is warranted.  Assume we wanted to find where the term “fleece” occurs.  In this case, let’s start by doing a “normal” search, wherein we submit a query to the index and print out the Dcoument id and Score:

    IndexSearcher searcher = new IndexSearcher(ramDir);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }

That code looks pretty much like any basic search code with the exception that I substituted in a SpanTermQuery for what is often a TermQuery.  In fact, so far this isn’t all that interesting and it is likely to be slower than the comparable TermQuery too.

What does make it interesting?  If you look at the SpanQuery API, you will notice a method called getSpans().  The getSpans() method provides positional information about where a match occurred.  Thus, to print out the positional information, one might do:

    Spans spans = fleeceQ.getSpans(searcher.getIndexReader());
    while (spans.next() == true){
      System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
    }

First off, notice getting the Spans is completely independent of running the actual query. In fact, you need not run the query first. Second, the start and end values are the positions of the tokens, not the offsets.

Now, given the position information, the question becomes how to get only those tokens around the match. To answer that, we need a few things:

  1. The specification of a window in terms of positions.  For instance, I want the terms within two positions of the start and end of the span.
  2. A TermVectorMapper implementation that is aware of both the window and the position.  Think of a TermVectorMapper as the equivalent of a SAX parser for Lucene’s Term Vectors.  Basically, instead of assuming the data structure (like DOM) it provides call backs and let’s you, the programmer, decide on the data structures.  See the PositionBasedTermVectorMapper for a useful implementation.

As a quick hack (and it is by no means production quality), I created the following code that modifies the printing code above:

    WindowTermVectorMapper tvm = new WindowTermVectorMapper();
    int window = 2;//get the words within two of the match, inclusive of the boundaries
    while (spans.next() == true) {
      System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
      //build up the window
      tvm.start = spans.start() - window;
      tvm.end = spans.end() + window;
      reader.getTermFreqVector(spans.doc(), "content", tvm);
      for (WindowEntry entry : tvm.entries.values()) {
        System.out.println("Entry: " + entry);
      }
      //clear out the entries for the next round
      tvm.entries.clear();
    }

Now, in this chunk of code, I first create a WindowTermVectorMapper (WTVM, beautiful name, right?) and then in the Spans loop, I tell the WTVM what my window looks like. Next up, I ask Lucene’s IndexReader for the TermVector and pass in my TermVectorMapper.  Finally, I print out the entries.

Of course, the last bit of useful info is what does the WTVM look like.  Here’s the most useful snippet of code:

public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) {
    for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions //we'll make this inclusive of the boundaries if (positions[i] >= start && positions[i] < end){
        WindowEntry entry = entries.get(term);
        if (entry == null) {
          entry = new WindowEntry(term);
          entries.put(term, entry);
        }
        entry.positions.add(positions[i]);
      }
    }
  }

As you can see, I just look at the positions and check to see if the current term has an entry that is inside the start and end. Obviously, you can do more interesting things here, but I’ll leave that up to you. Also know that there are a few TermVectorMapper implementations in the Lucene distribution that you can use as examples.

That about wraps it up.  From here, one can easily imagine different ways to utilize the information returned from the Term Vector Mapper to process information about the terms in a window.

The full code is below. It is intended for demonstration purposes only. Please note the disclaimers, etc.

package com.lucidimagination.noodles;
/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermVectorMapper;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;

import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.ArrayList;

/**
 *  This class is for demonstration purposes only.  No warranty, guarantee, etc. is implied.
 *
 * This is not production quality code!
 *
 *
 **/
public class TermVectorFun {
  public static String[] DOCS = {
          "The quick red fox jumped over the lazy brown dogs.",
          "Mary had a little lamb whose fleece was white as snow.",
          "Moby Dick is a story of a whale and a man obsessed.",
          "The robber wore a black fleece jacket and a baseball cap.",
          "The English Springer Spaniel is the best of all dogs."
  };

  public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher
    IndexSearcher searcher = new IndexSearcher(ramDir);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader);
    WindowTermVectorMapper tvm = new WindowTermVectorMapper();
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
      System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
      //build up the window
      tvm.start = spans.start() - window;
      tvm.end = spans.end() + window;
      reader.getTermFreqVector(spans.doc(), "content", tvm);
      for (WindowEntry entry : tvm.entries.values()) {
        System.out.println("Entry: " + entry);
      }
      //clear out the entries for the next round
      tvm.entries.clear();
    }
  }

}

//Not thread-safe
class WindowTermVectorMapper extends TermVectorMapper {

  int start;
  int end;
  LinkedHashMap entries = new LinkedHashMap();

  public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) {
    for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions //we'll make this inclusive of the boundaries if (positions[i] >= start && positions[i] < end){
        WindowEntry entry = entries.get(term);
        if (entry == null) {
          entry = new WindowEntry(term);
          entries.put(term, entry);
        }
        entry.positions.add(positions[i]);
      }
    }
  }

  public void setExpectations(String field, int numTerms, boolean storeOffsets, boolean storePositions) {
    // do nothing for this example
    //See also the PositionBasedTermVectorMapper.
  }

}

class WindowEntry{
  String term;
  List positions = new ArrayList();//a term could appear more than once w/in a position

  WindowEntry(String term) {
    this.term = term;
  }

  @Override
  public String toString() {
    return "WindowEntry{" +
            "term='" + term + ''' +
            ", positions=" + positions +
            '}';
  }
}