What Is the SpanQuery?
SpanQuerys allow for nested, positional restrictions when matching documents in Lucene. SpanQuery’s are much like PhraseQuerys or MultiPhraseQuerys in that they all restrict term matches by position, but SpanQuerys can be much more expressive. The basic SpanQuery units are the SpanTermQuery and the SpanNearQuery. A SpanTermQuery is the most basic SpanQuery, and simply lets you specify a field, term, and boost by passing in a Term, just like a TermQuery. SpanTermQuery is used as a basic building block in building up combining SpanQuery classes, like SpanNearQuery. A SpanNearQuery will look to find a number of SpanQuerys within a given distance from each other. You can specify that the spans must come in the order specified, or that order should not be considered. These SpanQuerys could be any number of TermQuerys, other SpanNearQuerys, or one of the other SpanQuerys mentioned below. You can nest arbitrarily, eg SpanNearQuerys can contain other SpanNearQuerys that contain still other SpanNearQuerys, etc. Say we want to find lucene within 5 positions of doug, with doug following lucene (order matters) – you could use the following SpanQuery:
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "lucene")),
new SpanTermQuery(new Term(FIELD, "doug"))},
5,
true);
The SpanNearQuery constructor takes an array of SpanQuerys, the distance allowed between spans, and a boolean indicating whether order (as indicated by the order of the SpanQuery array) is required. You can specify a similar query with a PhraseQuery in that you can specify lucene within 5 of doug, but you can not precisely control order unless the terms are right next to each other (eg you specify a zero slop). To find lucene within 5 of doug, you would have to raise the slop above 1, and when you do that, you are allowing for a larger edit distance to match, and edit distance does not limit by order and distance but by ‘term moves’ (note: Lucene does not use a classic edit distance, but an edit distance-like algorithm). Edit distance is a bit less intuitive than straight up positional difference, but its enough to know that, with edit distance, you cannot restrict by order except in the case of allowing 0 or 1 ‘term moves’ (or slop in Lucene lingo). Once you set a slop of 2, thats enough to allow terms to start matching out of order. Below is another SpanNearQuery example. This time we are looking for doug within 5 after lucene and then hadoop within 4 after the lucene -> doug span.
SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "lucene")),
new SpanTermQuery(new Term(FIELD, "doug"))},
5,
true);
new SpanNearQuery(new SpanQuery[] {
spanNear,
new SpanTermQuery(new Term(FIELD, "hadoop"))},
4,
true);
Not only can you nest SpanTermQuerys and SpanNearQuerys within SpanNearQuerys, but there are also a few other Span classes that can be used to combine and nest SpanQuerys: SpanOrQuery The SpanOrQuery takes an array of SpanQuerys and will match if any of the underlying SpanQueries match. SpanNotQuery The SpanNotQuery takes two SpanQuerys as parameters – a SpanQuery to search for, as well as a SpanQuery that prevents a match if the matching SpanQuery overlaps with it. This lets you do things like search for george within 10 of bush without spanning w. SpanFirstQuery The SpanFirstQuery lets you specify that a matching Spans end position must come before a given position passed to the SpanFirstQuery. In other words, it allows you to search for Spans that start and end within the first n positions of the document. In certain situations, it can be convenient to have a SpanAndQuery. You can easily simulate this using a SpanNearQuery with a distance of Integer.MAX_VALUE. The standard Lucene QueryParser has never had a syntax to specify SpanQuerys. A new parser that extends the old has been added that uses SpanQueries to allow limited Lucene syntax within phrase queries, but while this parser generates SpanQuerys to allow this functionality, it doesn’t let you control/specify their creation. To actually harness the power of SpanQuerys, you either have to construct them manually in code, or check out one of the alternate QueryParsers in Lucene contrib. Check out a previous post on query parsers: Exploring Query Parsers – Surround, Xml-Query-Parser and Qsol all support Spans (note: Qsol cannot express the full range of SpanQuerys) The SpanScorer (available in Solr as hl.usePhraseHighlighter) can be used to highlight Span queries, allowing for position correct highlighting (the standard Highlighter just matches terms regardless of position). It even correctly highlights phrases by converting a PhraseQuery to a very similar SpanQuery. Once people start seeing what terms get highlighted, they often have questions regarding how Spans match – specifically: if I search for worda within n of wordb, why isn’t every occurrence of worda within n of wordb get highlighted? It has to do with how SpanQuerys match. What does it mean to require that Spans come in order and how do SpanQuerys actually match? Consider the following query: (lucene within 3 of doug) within 0 of (was within 3 of cutting) [in order]
SpanNearQuery spanNear1 = new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "lucene")),
new SpanTermQuery(new Term(FIELD, "doug"))},
3,
true);
SpanNearQuery spanNear2 = new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "was")),
new SpanTermQuery(new Term(FIELD, "cutting"))},
3,
true);
new SpanNearQuery(new SpanQuery[] {
spanNear1,
spanNear2},
0,
true);
The above diagram shows how this query might match. You can see that, even though we asked for the spans to come in order, and the distance between spans must be 0, the match was->cutting overlaps with lucene->doug. This is because an in order SpanNearQuery can match with a distance of 0 if the second Span starts either one after the start of the first span up to one after the end of the first span. So using made instead of was would match, as well as using by instead of was. If the second span started with cutting, it could still match, because doug and cutting have a distance of 0 between them. So any span starting from was to cutting is within 0 of the first lucene->doug span. Distance is measured from the end of span1 to the start of span2, but the order restriction only means that the start of span2 must come after the start of span1. Another example: Consider the text [cats and dogs and cats and cats]. You might first assume that if you used the Span Highlighter to highlight this for a query of cats within 10 of dogs (order doesn’t matter)that every instance of cats and dogs would be highlighted. After all, each instance is within 10 of each other. Lets see what actually happens though: You can see that the final cats is not highlighted. This is because Lucene defines Spans as non overlapping. This means every Span must start after (at least one position after) the last Span started. In the above example, there are two matching spans, cats->dogs and dogs->cats.
For the second cats to be included as a Span, the Span would have to start at dogs. But a Span already goes from dogs to the first cats – there cannot be another span that begins at dogs. SpanQuerys do not do exhaustive matching – but if there is at least one match, they will find it.
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.