Mark Miller recently posted a brief intro to Span Queries, so I thought I would piggyback on top of his work and show how to get started with Payloads (see also [1]).

Introduction

Like Spans, payloads involve the position of terms, but go one step further.  Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index.  A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information.  If you read Brin and Page’s (you know, the Google guys) original paper Anatomy of a Large-Scale Hypertextual Search Engine,  they describe what is essentially a payload functionality, whereby they store information about font, etc. at a specific position in the index (remember when you could get your pages ranked number one by using really big fonts?) and then utilize it in search.

There are three parts to taking advantage of payloads in Lucene.  Solr requires a fourth step, which I will explain in a moment.

  1. Add a Payload to one or more Tokens during indexing.
  2. Override the Similarity class to handle scoring payloads
  3. Use a Payload aware Query during your search

For Solr, step 3 requires you to have your own Query Parser, as none of the existing Solr Query Parsers support the BoostingTermQuery.  Thus, the third step for Solr is add a Query Parser that supports payloads (and Spans would be nice, too!  Please donate if you do this!)

Adding Payloads during indexing

(I’m using Lucene 2.9-dev)

I’m going to use the same indexing code I did for my post on co-occurrence analysis, but with a few modifications.

First off, I’m going to change Analyzers to one of my own creation:

class PayloadAnalyzer extends Analyzer {
    private PayloadEncoder encoder;

    PayloadAnalyzer(PayloadEncoder encoder) {
      this.encoder = encoder;
    }

    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new WhitespaceTokenizer(reader);
      result = new LowerCaseFilter(result);
      result = new DelimitedPayloadTokenFilter(result, '|', encoder);
      return result;
    }
  }

In this Analyzer, I have the basic whitespace tokenizer and a lower case filter, but then I add in the recently added DelimitedPayloadTokenFilter (DPTF). The DPTF allows you to add payloads to tokens simply by marking up the tokens with a special character followed by the payload value. For instance, I changed my sample docs from the co-occurrence example to now include payload information.  Specifically, I said that all nouns should be weighted by 10, all verbs by 5 and all adjectives by 2 (I used http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php to tag the sentences, any errors are likely mine.)  Everything else has no payload.   I also stripped all punctuation. My DOCS array now looks like:

public static String[] DOCS = {
           "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0",
          "The quick red fox jumped over the lazy brown dogs",//no boosts
          "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 box|10.0",
          "Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0",
          "Mary had a little lamb whose fleece was white as snow",
          "Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0",
          "Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
          "Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0",
          "The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0",
          "The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0"
  };

The DOCS array simply marks each noun, verb and adjective with a | (pipe symbol) and then a float indicating the boost. I also added some docs that have no boosts at all to demonstrate the differences at query time. The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene’s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well.  Furthermore, the PayloadHelper class can help encode/decode payloads for common types.

Overriding the Similarity Class

The next step, which should happen before indexing, is to override the Similarity class to handle payloads.  While it is isn’t strictly required that this happens before indexing in THIS case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)

Overriding the Similarity is done on both the IndexWriter and the IndexSearcher.  See [3] below for the full code, including the calls to set the similarity. My Similarity implementation simply converts the byte array to a float and returns the float, as in:

class PayloadSimilarity extends DefaultSimilarity {
    @Override
    public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
      return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
    }
}

Executing the Query

Currently, Lucene has one payload aware Query called the BoostingTermQuery (BTQ for short,  see [2] for another Payload aware query that may be in Lucene 2.9), which can be used just like any other query.  For instance:

IndexSearcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(payloadSimilarity);
BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
TopDocs topDocs = searcher.search(btq, 10);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
   ScoreDoc doc = topDocs.scoreDocs[i];
   System.out.println("Doc: " + doc.toString());
   System.out.println("Explain: " + searcher.explain(btq, doc.doc));
}

In this example, I create the BTQ and hand it to the searcher and then print out the results.  Easy peasy, yet so powerful.

Running this yields:

———–
Results for body:fox of type: org.apache.lucene.search.payloads.BoostingTermQuery
Doc: doc=0 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 0), product of:
7.071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
10.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=0)

Doc: doc=2 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 2), product of:
7.071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
10.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=2)

Doc: doc=1 score=0.42344445
Explain: 0.42344445 = (MATCH) fieldWeight(body:fox in 1), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=1)

Notice how both Doc 0 and Doc 2, which both contain the word “fox” in the body occur before doc 1 even though they all have the same term frequency and length.

Running the a simple TermQuery (ignoring payloads) with the exact same Term, on the other hand, yields:

———–
Results for body:fox of type: org.apache.lucene.search.TermQuery
Doc: doc=0 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 0), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=0)

Doc: doc=1 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 1), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=1)

Doc: doc=2 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 2), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=2)

As you can see, in the TermQuery case, all the docs are scored exactly the same.

Next Steps

As you can see from above, getting started with Payloads is pretty easy.  In reality, the only hard part is determining what exactly to put in your payload and then how it should factor into your score.  Lucene takes care of the rest.  Tools like UIMA, OpenNLP and other proprietary vendors can often be used to provide higher level lexical, syntactical and semantic information about tokens, thus giving you the power to create very expressive payloads and richer search applications.

[1] See Michael Busch’s talk at the last SF Meetup for more details on payloads: http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/

[2] https://issues.apache.org/jira/browse/LUCENE-1341

[3] Full class:

package com.lucidimagination.noodles;

import junit.framework.TestCase;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.PayloadEncoder;
import org.apache.lucene.analysis.payloads.FloatEncoder;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.payloads.BoostingTermQuery;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;

import java.io.Reader;
import java.io.IOException;

/**
 *
 *
 **/
public class PayloadTest extends TestCase {
Directory dir;

  public static String[] DOCS = {
          "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0",
          "The quick red fox jumped over the lazy brown dogs",//no boosts
          "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 brown|2.0 box|10.0",
          "Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0",
          "Mary had a little lamb whose fleece was white as snow",
          "Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0",
          "Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
          "Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0",
          "The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0",
          "The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0"
  };
  protected PayloadSimilarity payloadSimilarity;

  @Override
  protected void setUp() throws Exception {
    dir = new RAMDirectory();

    PayloadEncoder encoder = new FloatEncoder();
    IndexWriter writer = new IndexWriter(dir, new PayloadAnalyzer(encoder), true, IndexWriter.MaxFieldLength.UNLIMITED);
    payloadSimilarity = new PayloadSimilarity();
    writer.setSimilarity(payloadSimilarity);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("body", DOCS[i], Field.Store.NO, Field.Index.ANALYZED);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
  }

  public void testPayloads() throws Exception {
    IndexSearcher searcher = new IndexSearcher(dir, true);
    searcher.setSimilarity(payloadSimilarity);//set the similarity.  Very important
    BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
    TopDocs topDocs = searcher.search(btq, 10);
    printResults(searcher, btq, topDocs);

    TermQuery tq = new TermQuery(new Term("body", "fox"));
    topDocs = searcher.search(tq, 10);
    printResults(searcher, tq, topDocs);
  }

  private void printResults(IndexSearcher searcher, Query query, TopDocs topDocs) throws IOException {
    System.out.println("-----------");
    System.out.println("Results for " + query + " of type: " + query.getClass().getName());
    for (int i = 0; i < topDocs.scoreDocs.length; i++) {
      ScoreDoc doc = topDocs.scoreDocs[i];
      System.out.println("Doc: " + doc.toString());
      System.out.println("Explain: " + searcher.explain(query, doc.doc));
    }
  }

  class PayloadSimilarity extends DefaultSimilarity {
    @Override
    public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
      return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
    }
  }

  class PayloadAnalyzer extends Analyzer {
    private PayloadEncoder encoder;

    PayloadAnalyzer(PayloadEncoder encoder) {
      this.encoder = encoder;
    }

    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new WhitespaceTokenizer(reader);
      result = new LowerCaseFilter(result);
      result = new DelimitedPayloadTokenFilter(result, '|', encoder);
      return result;
    }
  }
}