Date Math, NOW and filter queries

| By

Tags: #Solr

Or “How to never re-use cached filter query results even though you meant to”:

Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see solrconfig.xml) to calculate the set of documents satisfying the query once and then re-use that result set. Often, using NOW in filter queries causes this caching to be useless. Here’s why.

Solr maintains a filterCache, where it stores the results of “fq” clauses. You can think of it as a map, where the key is the “fq” clause and the value is the set of documents satisfying that clause. I’m going to skip the details of how the document set (the “value” in this map) is stored, since this post is really concentrating on the key.

So, let’s say you have two filter queries (whether they’re in the same query or not is irrelevant), something like: “fq=category:books&fq=source:library”. There will be two entries in the filterCache, something like:

category:books => 1, 2, 5, 89…
source:library => 2, 5, 7, 45, 101…

All well and good so far. I’ll add one short diversion here. This bears on why it is often better to have several “fq” clauses than a single one. The same results could be obtained by “fq=category:books AND source:library”, but then the filter cache would look like:

category:books AND source:library => 2, 5…..

and an fq like “fq=category:books” would NOT re-use the cache entry since the key is much different, not to mention the result set. That said, any clause containing OR cannot be expressed as separate fq clauses since separate fq clauses are set intersection (AND) operations. But enough of a diversion…

OK, you mentioned dates. Get to the point.

It’s common to have date ranges as filter queries, things like “in the last day”, “in the last week”, etc. And there’s the convenient date math to make this easy. So it’s tempting, very tempting to have filter clauses on date ranges like “fq=date:[NOW-1DAY TO NOW]”. Be careful when using NOW!

Here’s the problem. In the above example, date:[NOW-1DAY TO NOW] is not what’s used as the key for the fq in the filterCache, the expansion is used as the key. This translates into a form like: “date:[2012-01-20T08:56:23Z TO 2012-01-27T8:56:23Z]” for the key into the filter cache. Now the user adds a term to the “q” and re-submits the query 30 seconds after the first one. The fq clause now looks something like: “fq=date:[2012-01-20T08:56:53Z TO 2012-01-27T8:56:53Z]” note that the seconds went from 23 to 53!

The key for this fq does not match the key for the first, even though it’s often the case that the intent is that submitting this kind of fq 30 seconds later would result in the same set of documents matching the filter. Bare NOW entries in filter clauses will pretty much guarantee that the cached result sets will never be reused.

Fine. What do you do to make it better?

Here’s where rounding makes sense. Using midnight can make sense from two perspectives.

  • The sense you often want is “anything with a timestamp in a particular day” (or month or year or hour or….). So just using NOW for the lower bound would miss anything published between midnight and whenever the user happens to submit the query on the day (in this example) of the lower bound.
  • Re-using the filter cache can substantially speed up your queries, especially if you’re providing links like “in the last day”, “1-7 days ago” etc.

So your fq clauses start to look like “fq=date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY]”. The thing to note about the date math “/” operator is that is is a “round down” operator. So let’s break this up a bit: NOW/DAY rounds down to midnight last night. -7DAYS subtracts 7 whole days. So the lower limit is really “midnight 7 days ago”. Similarly, NOW/DAY rounds to midnight last night and +1DAY moves that to midnight tonight for the upper limit. These clauses are invariant until after midnight tonight so these clauses will return the same results all day today, and only the first submission of this fq will incur the cost of figuring out which documents satisfy it, all the queries after the first will just read the cached result set from the filterCache. Of course the caches are invalidated if you update your index and/or a replication happens, but that’s always the case.

You will note that there is a bit of “slop” here. If your index has dates in the future, you may get them too. Suppose you have a situation where your index contains documents you don’t want the users to see until it’s later than their timestamp. I actually have a hard time contriving an example here, but let’s just assume it’s the case. Also say it’s noon and your index contains timestamps on documents through midnight tonight. The above technique will show documents that will be officially published at, say, 15:00 even though it’s only 12:00 and you may not want that. In that case, you’ll have to use a bare NOW clause and live with the fact that your cache isn’t being used for these clauses. Like I said, this is contrived, but I mention it for completeness’ sake.

A couple of notes about dates:

Before I finish, a couple of notes about dates.

  • Use the coarsest dates you can and still satisfy your use case. This is especially true if you’re sorting by dates. The sorting resource requirements go up by the number of unique terms. So storing millisecond resolution when all you care about is day can be wasteful. This is also true when faceting.
  • It’s often useful to index multiple fields with some date data, especially if you intend to facet.
  • The above examples in the 3.x code line have a slight problem when more than one adjoining range is required. The range operator “[]” is inclusive, so if you have a document indexed at exactly midnight in these examples, it might be included in two ranges. Trunk Solr (4.0) allows mixing inclusive “[]” and exclusive “{}” endpoints, so expressions like “date:[NOW/DAY-1DAY TO NOW/DAY+1DAY}” are possible.
  • An exercise for the reader: What are the consequences of using different kinds of rounding? E.g. NOW/5MIN, NOW/72MIN (does this even work?).

Share on FacebookTweet about this on TwitterShare on Google+Pin on PinterestShare on RedditShare on LinkedIn

Your email address will not be published. Required fields are marked *




This is a great post! I ran into this when debugging search performance problems and found that some of my caches were missing all the time because of the use of NOW(). Rounding to avoid the constantly changing time fields tremendously helped cache hit rates. Cache stats via the admin console are your friend :-)


There is a legitimate case where documents in the index might be dated in the future, and this is if you are indexing web content that won’t be “live” until the future. This will often happen for certain announcements, promotions, alerts, etc. The content will be created and approved, but won’t be live to the end users until a pre-defined date.

You could also argue that the content shouldn’t be indexed until it’s live. So I suppose it depends on whether the logic should be in the custom crawler or the search.

Erick Erickson

True. As always it’s a tradeoff. Personally I prefer the “don’t index it until it’s live” approach for no other reason than I can’t goof up and show things that shouldn’t be shown if they’re not in the index in the first place :).

One can still use the technique I outlined but, say, on a finer granularity, maybe NOW/HOUR+1HOUR as the end date, or even NOW/MINUTE+10MINUTES which would make your fq caches useful at least for some time rather than never being re-used at all….

But I agree, there will be some situations where none of this is possible and you just have to live with not re-using this form of filter, in which case you should probably use the {!cache=false} syntax on your fq so you don’t waste cache space on never-reused filters. Just make the decision a conscious one!


” all the queries after the first will just read the cached result set from the filterCache. Of course the caches are invalidated if you update your index and/or a replication happens, but that’s always the case.”

Actually, if we have set the Auto Warming attribute, during the opening of the new IndexSearcher , will be warming the caches ( the filter query cache as well ) .
So also after a commit, we can still see the old results cached ( but of course this is a normal process) .
So choose the granurality of our dates is still really important.

Erick Erickson

“So also after a commit, we can still see the old results cached”..

What you actually see is the results of re-executing the most recent filter queries, where “most recent” is autoWarmCount. Which would have the curious result that if you don’t do the rounding, the result set for the “same” filter query would contain different results.

The long and short of it is, as you mention, that taking care with the granularity of dates is important.