# Why Not AND, OR, And NOT?

| By

Tags: #Query Parser

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-“).

## Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator'” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile'”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

## BooleanQuery: Great Class, Bad Name

To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation. The BooleanQuery class is probably one of the most misleading class names in the entire Lucene code base because it doesn’t model simple boolean logic query operations at all. The basic function of a BooleanQuery is:

1. A BooleanQuery consists of one or more BooleanClauses, each of which contains two pieces of information:

• A nested Query
• An Occur flag, which has one of three values
• `MUST` – indicating that documents must match this nested Query in order for the document to match the BooleanQuery, and the score from this subquery should contribute to the score for the BooleanQuery
• `MUST_NOT` – indicating that documents which match this nested Query are prohibited from matching the BooleanQuery
• `SHOULD` – indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query
2. If a BooleanQuery contains no `MUST` BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the `SHOULD` BooleanClauses is a match.
3. The final score of a document which matches a BooleanQuery is based on the sum of the scores from all the matching `MUST` and `SHOULD` BooleanClauses, multiplied by a “coord factor” based on the ratio of the number of matching BooleanClauses to the total number of BooleanClauses in the BooleanQuery.

These rules are not exactly simple to understand. They are certainly more complex then boolean logic truth tables, but that’s because they are more powerful. The examples below show how easy it is to implement “pure” boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:

• Conjunction: `(X ∧ Y)`

• Disjunction: `(X ∨ Y)`

• Negation: `(X ¬ Z)`

## Query Parser: Prefix Operators

In the Lucene QueryParser (and all of the other parsers that are based on it, like DisMax and EDisMax) the “prefix” operators “+” and “-” map directly to the Occur.MUST and Occur.MUST_NOT flags, while the absence of a prefix maps to the Occur.SHOULD flag by default. (If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I’ve been trying to think of a good one for years.) So using the prefix syntax, you can express all of the permutations that the BooleanQuery class supports — not just simple boolean logic:

• `(+X +Y)` … Conjunction, ie: (X ∧ Y)

• `(X Y)` … Disjunction, ie: (X ∨ Y)

• `(+X -Z)` … Negation, ie: (X ¬ Z)

• `((X Y) -Z)` … ((X ∨ Y) ¬ Z)

• `(X Y -Z)` … Not expressible in simple boolean logic

Note in particular the differences between the last two examples. `(X Y -Z)` is a single BooleanQuery object containing three clauses, while `((X Y) -Z)` is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses. In both cases a document must match either “X” or “Y” and it can not match against “Z” (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the “X” and “Y” clauses; but because of the difference in their structures, the scores for these queries will be different for every document. In particular, the coord factor will cause documents matching only one of “X” or “Y” (but not both) to have extremely different scores from each of these queries. (This assumes that the DefaultSimilarity is being used; it would be possible to write a custom Similarity to force the scores to be equivalent)

## Query Parser: “Boolean Operators”

The query parser also supports the so called “boolean operators” which can also be used to express boolean logic, as demonstrated in these examples:

• `(X AND Y)` … Conjunction, ie: (X ∧ Y)

• `(X OR Y)` … Disjunction, ie: (X ∨ Y)

• `(X NOT Z)` … Negation, ie: (X ¬ Z)

• `((X AND Y) OR Z)` … ((X ∧ Y) ∨ Z)

• `((X OR Y) AND Z)` … ((X ∨ Y) ∧ Z)

• `(X AND (Y NOT Z))` … (X ∧ (Y ¬ Z))

Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports an arbitrary number of clauses, so `(X OR Y OR Z)` is a single BooleanQuery with three clauses — it is not equivalent to either `((X OR Y) OR Z)` or `(X OR (Y OR Z))` because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery. As mentioned above when discussing the prefix operators, the scores from each of those queries will all be different depending on which clauses are matched by each document.

Things definitely get very confusing when these “boolean operators” are used in ways other then those described above. In some cases this is because the query parser is trying to be forgiving about “natural language” style usage of operators that many boolean logic systems would consider a parse error. In other cases, the behavior is bizarrely esoteric:

• Queries are parsed left to right
• `NOT` sets the Occurs flag of the clause to it’s right to `MUST_NOT`
• `AND` will change the Occurs flag of the clause to it’s left to `MUST` unless it has already been set to `MUST_NOT`
• `AND` sets the Occurs flag of the clause to it’s right to `MUST`
• If the default operator of the query parser has been set to “And”: `OR` will change the Occurs flag of the clause to it’s left to `SHOULD` unless it has already been set to `MUST_NOT`
• `OR` sets the Occurs flag of the clause to it’s right to `SHOULD`

Practically speaking this means that `NOT` takes precedence over `AND` which takes precedence over `OR` — but only if the default operator for the query parser has not been changed from the default (“Or”). If the default operator is set to “And” then the behavior is just plain weird.

## In Conclusion

I won’t try to defend or justify the way the query parser behaves when it encounters these “boolean operators”, because in many cases I don’t understand or agree with the behavior myself — but that’s not really the point of this article. My goal isn’t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren’t a good representation of the underlying functionality available in the BooleanQuery class.

Do yourself a favor and start thinking about BooleanQuery as a container of arbitrary nested queries annotated with `MUST`, `MUST_NOT`, or `SHOULD` and discover the power that is available to you beyond simple boolean logic.

Many thanks to Bill Dueber whose recent related blog post reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.

Maria Vazquez

Great article!
The last one is very confusing… I usually write queries with prefix operators. Maybe for should you can use the prefix +- together :-)
Thanks!

hossman

anon: i’m struggling for a way to respond to your question in a way that wouldn’t just be a restatement of much of the article.

your question seems almost like you read a single line: “(X Y -Z) … Not expressible in simple boolean logic” w/o reading the following paragraph…

Note in particular the differences between the last two examples. (X Y -Z) is a single BooleanQuery object containing three clauses, while ((X Y) -Z) is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses. In both cases a document must match either “X” or “Y” and it can not match against “Z” (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the “X” and “Y” clauses; but because of the difference in their structures, the scores for these queries will be different for every document. In particular, the coord factor will cause documents matching only one of “X” or “Y” (but not both) to have extremely different scores from each of these queries.

…i’m not sure what i can add to explain it more then that.

Zing

Grammar. “it’s right” -> “its right”.

Also, something I have never much liked about Lucene’s query syntax: If the default operator is set to AND, what is the syntax to get “SHOULD” behaviour in a query if you want it? :)

anon

Don’t resist the obvious! Sorry for skimming and commentong, but that line still seems confusing. The ranking differs, but sets are not ordered.

hossman

Zing: which is why i always advise people not to change the default op as well. as i mentioned “If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I’ve been trying to think of a good one for years”

anon: again, that’s a big point i tried to stress repeatedly in this article. Boolean Logic (and the boolean operators) make sense when all you care about is set membership — but good search behavior is about SOOOO much more then identifying sets. Order is *everything* in a good search experience (note the entire first section titled “Boolean Logic Makes For Terrible Scores”)

skamio

I think that the last line of the last example should be MUST. Am I wrong?

* (X AND (Y NOT Z)) … (X ∧ (Y ¬ Z))

q.add(inner, Occur.MUST_NOT); -> should be Occur.MUST?

hossman

skamio: you are most certainly correct. i’ve updated the post, thanks for pointing that out.

TimH

Very provocative article, and I’m experiencing the problems you describe with AND and OR. I’m using solrj in my application, though, and don’t understand how to develop and integrate a set of BooleanQuery objects into the SolrQuery. I CAN make changes to surround sets of fields with parentheses – is this the “answer” for users of solrj?

Thank you for your insights and guidance.

essential

Rather than simply saying boolean expressions are bad, if they are being used as a way to limit the set of documents to search (and not for any relevancy purposes) then shouldn’t they be moved to the fq parameter rather than the q?

hossman

TimH: the BooleanQuery code examples above are in reference to the underlying implementation in the Lucene code on the Solr server. In your SolrJ client you would construct strings representing the queries you want to execute using +/- or (if you really want to) the AND/OR syntax along with parentheses.

hossman

essential: whether you should use the “q” param or the “fq” param is a totally orthogonal question to the topic discussed in this post. You’ll note that at no point in the article do I refer to any specific Solr params. The advice and suggestions made here apply to any use of the lucene/dismax/edismax QParsers in Solr, regardless of which param they are used in (q, fq, bq, etc…)

TimH

hossman: I am a convert and am converting my Solrj query text to +/-. It appears from your examples that, using solr syntax, at most 2 operands can be grouped within parentheses to gain the relevancy I/we need. I have many situations wherein more than 2 operands need to be converted to SHOULD boolean clause flags in Lucene.

As an example, I might want “+(W X Y Z)”; my intent is to specify I MUST have at least one of those in my response. If only 2 operands are valid for SHOULD boolean clause flags in the underlying Lucene implementation, it seems I have to specify the query as “+(((W X) Y) Z)”.

I read the article you referenced from Bill Dueber and i came away with the impression that I can’t really have too many parentheses, but his contribution focused on the solr AND/OR syntax. I’m a bit confused here as to how that advice carries over to the +/- syntax under Solrj.

Thanks for your assistance, and I hope others benefit from this discussion. My relative inexperience with the toolset shows, I’m afraid.

hossman

TimH: I’m not sure where you got the impression that “at most 2 operands can be grouped within parentheses”. What I said was: “Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic.” — if you aren’t trying to model boolean logic, then you don’t need to use nested parens in that way.

If you goal is to say “I want a Query where at least one of the clauses W, X, Y, and Z must match, and the more that match the better the score should be” then a query for “W X Y Z” is a perfectly valid way of expressing that.

You only need to wrap the clauses in parens and put a “+” in front if you want to combine that with another query such that the the entire BooleanQuery is required, ie: “i want a Query where N is mandatory, and at least one of W, X, Y, and Z must match, and the more that match the better the score should be” => “+N +(W X Y Z)”

patrick

“(If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I’ve been trying to think of a good one for years.) ”

What about ~ as a character for the SHOULD operator ?

Swami

I would suggest using a ? operator (it gets used in regular expressions as well to specify optional inclusion) for Disjunction.