Announcements, Blog, SearchHub, Technical Article

Boolean Operators for Solr Users

by Hoss
December 28, 2011

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

EDITED: 2017-05-19 — Code samples and examples were updated to reflect Lucene 6.5 APIs and new default Similarity.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-“).

Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator'” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile'”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into account the term & document statistics for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

BooleanQuery: Great Class, Bad Name

To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation. The BooleanQuery class is probably one of the most misleading class names in the entire Lucene code base because it doesn’t model simple boolean logic query operations at all. The basic function of a BooleanQuery is:

A BooleanQuery consists of one or more BooleanClauses, each of which contains two pieces of information:
- A nested Query
- An Occur flag, which has one of four values
  - MUST / FILTER – indicating that documents must match this nested Query in order for the document to match the BooleanQuery. (The difference between them is that the score from MUST subqueries contributes to the final score for the BooleanQuery, but no scores from the FILTER subqueries are used)
  - MUST_NOT – indicating that documents which match this nested Query are prohibited from matching the BooleanQuery
  - SHOULD – indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query
If a BooleanQuery contains no MUST BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the SHOULD BooleanClauses is a match.
The final score of a document which matches a BooleanQuery is calculated primarily from the sum of the scores from all the matching MUST and SHOULD BooleanClauses.

These rules are not exactly simple to understand. They are certainly more complex then boolean logic truth tables, but that’s because they are more powerful. The examples below show how easy it is to implement “pure” boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:

Negation: (X ¬ Z)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Z, Occur.MUST_NOT)
  .build();

Disjunction: (X ∨ Y)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Y, Occur.SHOULD)
  .build();

Conjunction: (X ∧ Y)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.MUST)
  .add(Y, Occur.MUST)
  .build();

Query Parser: Prefix Operators

In the Lucene QueryParser (and all of the other parsers that are based on it, like DisMax and EDisMax) the “prefix” operators “+” and “-” map directly to the Occur.MUST and Occur.MUST_NOT flags, while the absence of a prefix maps to the Occur.SHOULD flag by default. (If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I’ve been trying to think of a good one for years.) So using the prefix syntax, you can express all of the permutations that the BooleanQuery class supports — not just simple boolean logic:

(+X -Z) … Negation, ie: (X ¬ Z)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.MUST)
  .add(Z, Occur.MUST_NOT)
  .build()

(+X +Y) … Conjunction, ie: (X ∧ Y)

BooleanQuery q = new BooleanQuery().Builder()
  .add(X, Occur.MUST)
  .add(Y, Occur.MUST)
  .build();

(X Y) … Disjunction, ie: (X ∨ Y)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Y, Occur.SHOULD)
  .build()

(X +Y) … Not expressible in simple boolean logic

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Y, Occur.MUST)
  .build()

Note in particular the differences between the last three examples. (X +Y) requires that documents match Y, and if they also match X then their scores are increased. This is not the same as a simple Conjunction (which would require both subqueries match) or a Disjunction (which would match any document matching either subquery). The closest equivalent query in terms of pure boolean logic is simply Y — which matches the same unordered set of documents, but the scores will be very different.

Consider the table below, showing how some hypothetical documents would score against some hypothetical X and Y queries (for simplicity, we’re assuming X and Y are “constant scoring” queries unaffected by any term or document statistics), as well as the final scores produced by various BooleanQuery structures composed of X and Y. As you can see, there is no pure boolean logic equivalent to (X +Y) that also produces the same scores for each document….

Document	`X`	`Y`	`+X +Y` X ∧ Y	`X Y` X ∨ Y	`X +Y`
doc1: { X }	1.0	n/a	n/a	1.0	n/a
doc2: { Y }	n/a	1.0	n/a	1.0	1.0
doc3: { X Y }	1.0	1.0	2.0	2.0	2.0

Query Parser: “Boolean Operators”

The query parser also supports the so called “boolean operators” which can also be used to express boolean logic, as demonstrated in these examples:

(X AND Y) … Conjunction, ie: (X ∧ Y)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.MUST)
  .add(Y, Occur.MUST)
  .build();

(X OR Y) … Disjunction, ie: (X ∨ Y)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Y, Occur.SHOULD)
  .build();

(X NOT Z) … Negation, ie: (X ¬ Z)

BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Z, Occur.MUST_NOT)
  .build();

((X AND Y) OR Z) … ((X ∧ Y) ∨ Z)

BooleanQuery inner = new BooleanQuery.Builder()
  .add(X, Occur.MUST)
  .add(Y, Occur.MUST)
  .build();
BooleanQuery q = new BooleanQuery().Builder()
  .add(inner, Occur.SHOULD)
  .add(Z, Occur.SHOULD)
  .build();

((X OR Y) AND Z) … ((X ∨ Y) ∧ Z)

BooleanQuery inner = new BooleanQuery.Builder()
  .add(X, Occur.SHOULD)
  .add(Y, Occur.SHOULD)
  .build();
BooleanQuery q = new BooleanQuery.Builder()
  .add(inner, Occur.MUST)
  .add(Z, Occur.MUST)
  .build();

(X AND (Y NOT Z)) … (X ∧ (Y ¬ Z))

BooleanQuery inner = new BooleanQuery.Builder()
  .add(Y, Occur.MUST)
  .add(Z, Occur.MUST_NOT)
  .build();
BooleanQuery q = new BooleanQuery.Builder()
  .add(X, Occur.MUST)
  .add(inner, Occur.MUST)
  .build();

Please note that it is important to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports one or more clauses, meaning that (X OR Y OR Z) will create a single BooleanQuery with three clauses — strictly speaking it is not equivalent to either ((X OR Y) OR Z) or (X OR (Y OR Z)) because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery. While the scores of all three of those queries will typically be the same using Lucene’s default Similarity class, those queries are structurally different, and other usages (or other Similarity functions) may produce subtly different results.

Things definitely get very confusing when these “boolean operators” are used in ways other then those described above. In some cases this is because the query parser is trying to be forgiving about “natural language” style usage of operators that many boolean logic systems would consider a parse error. In other cases, the behavior is bizarrely esoteric:

Queries are parsed left to right
NOT sets the Occurs flag of the clause to it’s right to MUST_NOT
AND will change the Occurs flag of the clause to it’s left to MUST unless it has already been set to MUST_NOT
AND sets the Occurs flag of the clause to it’s right to MUST
If the default operator of the query parser has been set to “And”: OR will change the Occurs flag of the clause to it’s left to SHOULD unless it has already been set to MUST_NOT
OR sets the Occurs flag of the clause to it’s right to SHOULD

Practically speaking this means that NOT takes precedence over AND which takes precedence over OR — but only if the default operator for the query parser has not been changed from the default (“Or”). If the default operator is set to “And” then the behavior is just plain weird.

In Conclusion

I won’t try to defend or justify the way the query parser behaves when it encounters these “boolean operators”, because in many cases I don’t understand or agree with the behavior myself — but that’s not really the point of this article. My goal isn’t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren’t a good representation of the underlying functionality available in the BooleanQuery class.

Do yourself a favor and start thinking about BooleanQuery as a container of arbitrary nested queries annotated with MUST, MUST_NOT, or SHOULD and discover the power that is available to you beyond simple boolean logic.

Many thanks to Bill Dueber whose recent related blog post reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.

About Hoss

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support