Solr TTL (Time to Live) & Document Expiration
Learn about two features related to the “expiration” of documents which can be used individually, or in combination.
Lucene & Solr 4.8 were released last week and you can download Solr 4.8 from the Apache mirror network. Today I’d like to introduce you to a small but powerful feature I worked on for 4.8: Document Expiration.
The DocExpirationUpdateProcessorFactory
provides two features related to the “expiration” of documents which can be used individually, or in combination:
- Periodically delete documents from the index based on an expiration field
- Computing expiration field values for documents from a “time to live” (TTL)
Auto-Delete Expired Documents
The biggest aspect of this Update Processor is it’s ability to automatically delete documents based on the values found in an “expiration date” field that you configure. This automatic deletion isn’t part of the normal Update Processor life cycle — it’s executed via a background timer process thread created by the Factory.
To use this automatic deletion feature, you must configure two options on the Factory:
expirationFieldName
– The name of the expiration field to useautoDeletePeriodSeconds
– How often the factory’s timer should trigger a delete to remove the documents
For example, with the configuration below the DocExpirationUpdateProcessorFactory
will create a timer thread that wakes up every 30 seconds. When the timer triggers, it will execute a deleteByQuery
command to remove any documents with a value in the press_release_expiration_date
field value that is in the past:
<processor class="solr.processor.DocExpirationUpdateProcessorFactory"> <int name="autoDeletePeriodSeconds">30</int> <str name="expirationFieldName">press_release_expiration_date</str> </processor>
After the deleteByQuery
has been executed, a soft commit is also executed using openSearcher=true
so that search results will no longer see the expired documents.
While the basic logic of “timer goes off, delete docs with expiration prior to NOW” was fairly simple and straight forward to add, a key aspect of making this work well was in a related issue (SOLR-5783) to ensure that the openSearcher=true
doesn’t do anything unless there really are changes in the index. This means that you can configure autoDeletePeriodSeconds
to very small values, and still rest easy that your search caches won’t get blown away every few seconds for no reason. The openSearcher=true
soft commits will only affect things if there really are changes in the index.
Compute Expiration Date from TTL
The second feature implemented by this Factory (and the key reason it’s implemented as an UpdateProcessorFactory
) is to use “TTL” (Time To Live) values associated with documents to automatically generate an expiration date value to put in the expirationFieldName
when documents are indexed.
By default, the DocExpirationUpdateProcessorFactory
will look for a _ttl_
request parameter on update requests, as well as a _ttl_
field in each doc that is indexed in that request. If either exist, they will be parsed as Date Math Expressions relative to NOW
and used to populate the expirationFieldName
. The per-document _ttl_
field based values override the per-request _ttl_
parameter.
Both the request parameter and field names use for specifying TTL values can be overridden by configuring ttlParamName
& ttlFieldName
on the DocExpirationUpdateProcessorFactory
. They can also be completely disabled by configuring them as null
. It’s also possible to use the TTL computation feature to generate expiration dates on documents, with out using the auto-deletion feature simply by not configuring the autoDeletePeriodSeconds
option (so that the timer will never run).
For example, in the configuration below, the Factory will look for a time_to_live
field in each document, and use that to compute an expiration value for the press_release_expiration_date
field. No request parameters will be checked for a TTL override, and no automatic deletion will occur:
<processor class="solr.processor.DocExpirationUpdateProcessorFactory"> <str name="expirationFieldName">press_release_expiration_date</str> <null name="ttlParamName"/> <!-- ignore _ttl_ request param --> <str name="ttlFieldName">time_to_live</str> <!-- NOTE: autoDeletePeriodSeconds not specified, no automatic deletes --> </processor>
This sort of configuration may be handy if you only want to logically hide documents for search clients based on a per-document TTL using something like: fq=-press_release_expiration_date:[* TO NOW/DAY]
, but still retain the documents in the index for other search clients.
An In Depth Example
Let’s walk through a full example of both features by modifying the Solr 4.8 example solrconfig.xml
to add the following update processor chain:
<updateRequestProcessorChain default="true"> <processor class="solr.TimestampUpdateProcessorFactory"> <str name="fieldName">timestamp_dt</str> </processor> <processor class="solr.processor.DocExpirationUpdateProcessorFactory"> <int name="autoDeletePeriodSeconds">30</int> <str name="ttlFieldName">time_to_live_s</str> <str name="expirationFieldName">expire_at_dt</str> </processor> <processor class="solr.FirstFieldValueUpdateProcessorFactory"> <str name="fieldName">expire_at_dt</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
A few things to note about this chain:
- It contains a simple
TimestampUpdateProcessorFactory
so that it will be easy to see when these documents were indexed in the query results I show below — but this is not needed forDocExpirationUpdateProcessorFactory
to function - The
DocExpirationUpdateProcessorFactory
instance uses aautoDeletePeriodSeconds
of 30 seconds and overrides thettlFieldName
– but the_ttl_
request param is still enabled - A
FirstFieldValueUpdateProcessorFactory
is configured on theexpire_at_dt
— this means that if a document is added with an explicit value in theexpire_at_dt
field, it will be used instead any value that might be added by theDocExpirationUpdateProcessorFactory
using the_ttl_
request param
With this configuration in place, let’s start indexing some docs, and executing some queries.
First up, we’ll add 2 documents — one with out any sort of TTL or expiration date, and one using a TTL field of 2 minutes.
$ date -u && curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/collection1/update?commit=true' -d '[ { "id" : "live_forever" }, { "id" : "live_2_minutes_a", "time_to_live_s" : "+120SECONDS" } ]' Wed May 7 22:17:46 UTC 2014 {"responseHeader":{"status":0,"QTime":574}}
A few seconds later, we’ll index 3 more documents using a “default” TTL request param of 5 minutes: One with an explicit expiration date far in the future; another doc with a 2 minute TTL field; and the third leveraging the TTL request parameter.
$ date -u && curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/collection1/update?commit=true&_ttl_=%2B5MINUTES' -d '[ { "id" : "live_a_long_time", "expire_at_dt" : "3000-01-01T00:00:00Z" }, { "id" : "live_2_minutes_b", "time_to_live_s" : "+120SECONDS" }, { "id" : "use_default_ttl" } ]' Wed May 7 22:17:51 UTC 2014 {"responseHeader":{"status":0,"QTime":547}}
A few seconds after that, we execute a query and see all 5 docs in the index. Looking at the timestamp_dt
field we can see exactly when each doc was added, and looking at the expire_at_dt
field we can see which docs have an expiration date — either explicitly, or via TTL calculation.
$ date -u && curl -X GET 'localhost:8983/solr/query?q=*:*' Wed May 7 22:17:57 UTC 2014 { "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"*:*"}}, "response":{"numFound":5,"start":0,"docs":[ { "id":"live_forever", "timestamp_dt":"2014-05-07T22:17:46.685Z", "_version_":1467483230500290560}, { "id":"live_2_minutes_a", "time_to_live_s":"+120SECONDS", "timestamp_dt":"2014-05-07T22:17:46.685Z", "expire_at_dt":"2014-05-07T22:19:46.685Z", "_version_":1467483230503436288}, { "id":"live_a_long_time", "expire_at_dt":"3000-01-01T00:00:00Z", "timestamp_dt":"2014-05-07T22:17:51.276Z", "_version_":1467483235314302976}, { "id":"live_2_minutes_b", "time_to_live_s":"+120SECONDS", "timestamp_dt":"2014-05-07T22:17:51.276Z", "expire_at_dt":"2014-05-07T22:19:51.276Z", "_version_":1467483235318497280}, { "id":"use_default_ttl", "timestamp_dt":"2014-05-07T22:17:51.276Z", "expire_at_dt":"2014-05-07T22:22:51.276Z", "_version_":1467483235320594432}] }}
Now we wait a little over 2.5 minutes, and run the query again — this time we only get 3 results, as the docs with a 2 minute TTL have already been deleted because their expiration date was reached.
$ date -u && curl -X GET 'localhost:8983/solr/query?q=*:*' Wed May 7 22:20:30 UTC 2014 { "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"*:*"}}, "response":{"numFound":3,"start":0,"docs":[ { "id":"live_forever", "timestamp_dt":"2014-05-07T22:17:46.685Z", "_version_":1467483230500290560}, { "id":"live_a_long_time", "expire_at_dt":"3000-01-01T00:00:00Z", "timestamp_dt":"2014-05-07T22:17:51.276Z", "_version_":1467483235314302976}, { "id":"use_default_ttl", "timestamp_dt":"2014-05-07T22:17:51.276Z", "expire_at_dt":"2014-05-07T22:22:51.276Z", "_version_":1467483235320594432}] }}
After another 2.5 minutes (roughly) we run the query again, and see that another doc (that was part of an update with the default TTL of 5 minutes) has now been deleted as well.
$ date -u && curl -X GET 'localhost:8983/solr/query?q=*:*' Wed May 7 22:22:57 UTC 2014 { "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"*:*"}}, "response":{"numFound":2,"start":0,"docs":[ { "id":"live_forever", "timestamp_dt":"2014-05-07T22:17:46.685Z", "_version_":1467483230500290560}, { "id":"live_a_long_time", "expire_at_dt":"3000-01-01T00:00:00Z", "timestamp_dt":"2014-05-07T22:17:51.276Z", "_version_":1467483235314302976}] }}
Of the 2 docs remaining, one has no expiration date and the other won’t expire in our lifetimes — but as long as the Solr server is running, the DocExpirationUpdateProcessorFactory
will keep checking every 30 seconds to see if something needs deleted.
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.