Lucene & Solr 4.8 were released last week and you can download Solr 4.8 from the Apache mirror network. Today I’d like to introduce you to a small but powerful feature I worked on for 4.8: Document Expiration.

The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:

  • Periodically delete documents from the index based on an expiration field
  • Computing expiration field values for documents from a “time to live” (TTL)

Auto-Delete Expired Documents

The biggest aspect of this Update Processor is it’s ability to automatically delete documents based on the values found in an “expiration date” field that you configure. This automatic deletion isn’t part of the normal Update Processor life cycle — it’s executed via a background timer process thread created by the Factory.

To use this automatic deletion feature, you must configure two options on the Factory:

  • expirationFieldName – The name of the expiration field to use
  • autoDeletePeriodSeconds – How often the factory’s timer should trigger a delete to remove the documents

For example, with the configuration below the DocExpirationUpdateProcessorFactory will create a timer thread that wakes up every 30 seconds. When the timer triggers, it will execute a deleteByQuery command to remove any documents with a value in the press_release_expiration_date field value that is in the past:

 <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
   <int name="autoDeletePeriodSeconds">30</int>
   <str name="expirationFieldName">press_release_expiration_date</str>
 </processor>

After the deleteByQuery has been executed, a soft commit is also executed using openSearcher=true so that search results will no longer see the expired documents.

While the basic logic of “timer goes off, delete docs with expiration prior to NOW” was fairly simple and straight forward to add, a key aspect of making this work well was in a related issue (SOLR-5783) to ensure that the openSearcher=true doesn’t do anything unless there really are changes in the index. This means that you can configure autoDeletePeriodSeconds to very small values, and still rest easy that your search caches won’t get blown away every few seconds for no reason. The openSearcher=true soft commits will only affect things if there really are changes in the index.

Compute Expiration Date from TTL

The second feature implemented by this Factory (and the key reason it’s implemented as an UpdateProcessorFactory) is to use “TTL” (Time To Live) values associated with documents to automatically generate an expiration date value to put in the expirationFieldName when documents are indexed.

By default, the DocExpirationUpdateProcessorFactory will look for a _ttl_ request parameter on update requests, as well as a _ttl_ field in each doc that is indexed in that request. If either exist, they will be parsed as Date Math Expressions relative to NOW and used to populate the expirationFieldName. The per-document _ttl_ field based values override the per-request _ttl_ parameter.

Both the request parameter and field names use for specifying TTL values can be overridden by configuring ttlParamName & ttlFieldName on the DocExpirationUpdateProcessorFactory. They can also be completely disabled by configuring them as null. It’s also possible to use the TTL computation feature to generate expiration dates on documents, with out using the auto-deletion feature simply by not configuring the autoDeletePeriodSeconds option (so that the timer will never run).

For example, in the configuration below, the Factory will look for a time_to_live field in each document, and use that to compute an expiration value for the press_release_expiration_date field. No request parameters will be checked for a TTL override, and no automatic deletion will occur:

 <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
   <str name="expirationFieldName">press_release_expiration_date</str>
   <null name="ttlParamName"/> <!-- ignore _ttl_ request param -->
   <str name="ttlFieldName">time_to_live</str>
   <!-- NOTE: autoDeletePeriodSeconds not specified, no automatic deletes -->
 </processor>

This sort of configuration may be handy if you only want to logically hide documents for search clients based on a per-document TTL using something like: fq=-press_release_expiration_date:[* TO NOW/DAY], but still retain the documents in the index for other search clients.

An In Depth Example

Let’s walk through a full example of both features by modifying the Solr 4.8 example solrconfig.xml to add the following update processor chain:

  <updateRequestProcessorChain default="true">
    <processor class="solr.TimestampUpdateProcessorFactory">
      <str name="fieldName">timestamp_dt</str>
    </processor>
    <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
      <int name="autoDeletePeriodSeconds">30</int>
      <str name="ttlFieldName">time_to_live_s</str>
      <str name="expirationFieldName">expire_at_dt</str>
    </processor>
    <processor class="solr.FirstFieldValueUpdateProcessorFactory">
      <str name="fieldName">expire_at_dt</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

A few things to note about this chain:

  • It contains a simple TimestampUpdateProcessorFactory so that it will be easy to see when these documents were indexed in the query results I show below — but this is not needed for DocExpirationUpdateProcessorFactory to function
  • The DocExpirationUpdateProcessorFactory instance uses a autoDeletePeriodSeconds of 30 seconds and overrides the ttlFieldName – but the _ttl_ request param is still enabled
  • A FirstFieldValueUpdateProcessorFactory is configured on the expire_at_dt — this means that if a document is added with an explicit value in the expire_at_dt field, it will be used instead any value that might be added by the DocExpirationUpdateProcessorFactory using the _ttl_ request param

With this configuration in place, let’s start indexing some docs, and executing some queries.

First up, we’ll add 2 documents — one with out any sort of TTL or expiration date, and one using a TTL field of 2 minutes.

$ date -u && curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/collection1/update?commit=true' -d '[
  { "id"             : "live_forever"           },
  { "id"             : "live_2_minutes_a",
    "time_to_live_s" : "+120SECONDS"            }
]'
Wed May  7 22:17:46 UTC 2014
{"responseHeader":{"status":0,"QTime":574}}

A few seconds later, we’ll index 3 more documents using a “default” TTL request param of 5 minutes: One with an explicit expiration date far in the future; another doc with a 2 minute TTL field; and the third leveraging the TTL request parameter.

$ date -u && curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/collection1/update?commit=true&_ttl_=%2B5MINUTES' -d '[
  { "id"             : "live_a_long_time",
    "expire_at_dt"   : "3000-01-01T00:00:00Z"   },
  { "id"             : "live_2_minutes_b",
    "time_to_live_s" : "+120SECONDS"            },
  { "id"             : "use_default_ttl"        }
]'
Wed May  7 22:17:51 UTC 2014
{"responseHeader":{"status":0,"QTime":547}}

A few seconds after that, we execute a query and see all 5 docs in the index. Looking at the timestamp_dt field we can see exactly when each doc was added, and looking at the expire_at_dt field we can see which docs have an expiration date — either explicitly, or via TTL calculation.

$ date -u && curl -X GET 'localhost:8983/solr/query?q=*:*'
Wed May  7 22:17:57 UTC 2014
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*"}},
  "response":{"numFound":5,"start":0,"docs":[
      {
        "id":"live_forever",
        "timestamp_dt":"2014-05-07T22:17:46.685Z",
        "_version_":1467483230500290560},
      {
        "id":"live_2_minutes_a",
        "time_to_live_s":"+120SECONDS",
        "timestamp_dt":"2014-05-07T22:17:46.685Z",
        "expire_at_dt":"2014-05-07T22:19:46.685Z",
        "_version_":1467483230503436288},
      {
        "id":"live_a_long_time",
        "expire_at_dt":"3000-01-01T00:00:00Z",
        "timestamp_dt":"2014-05-07T22:17:51.276Z",
        "_version_":1467483235314302976},
      {
        "id":"live_2_minutes_b",
        "time_to_live_s":"+120SECONDS",
        "timestamp_dt":"2014-05-07T22:17:51.276Z",
        "expire_at_dt":"2014-05-07T22:19:51.276Z",
        "_version_":1467483235318497280},
      {
        "id":"use_default_ttl",
        "timestamp_dt":"2014-05-07T22:17:51.276Z",
        "expire_at_dt":"2014-05-07T22:22:51.276Z",
        "_version_":1467483235320594432}]
  }}

Now we wait a little over 2.5 minutes, and run the query again — this time we only get 3 results, as the docs with a 2 minute TTL have already been deleted because their expiration date was reached.

$ date -u && curl -X GET 'localhost:8983/solr/query?q=*:*'
Wed May  7 22:20:30 UTC 2014
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"*:*"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"live_forever",
        "timestamp_dt":"2014-05-07T22:17:46.685Z",
        "_version_":1467483230500290560},
      {
        "id":"live_a_long_time",
        "expire_at_dt":"3000-01-01T00:00:00Z",
        "timestamp_dt":"2014-05-07T22:17:51.276Z",
        "_version_":1467483235314302976},
      {
        "id":"use_default_ttl",
        "timestamp_dt":"2014-05-07T22:17:51.276Z",
        "expire_at_dt":"2014-05-07T22:22:51.276Z",
        "_version_":1467483235320594432}]
  }}

After another 2.5 minutes (roughly) we run the query again, and see that another doc (that was part of an update with the default TTL of 5 minutes) has now been deleted as well.

$ date -u && curl -X GET 'localhost:8983/solr/query?q=*:*'
Wed May  7 22:22:57 UTC 2014
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*"}},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "id":"live_forever",
        "timestamp_dt":"2014-05-07T22:17:46.685Z",
        "_version_":1467483230500290560},
      {
        "id":"live_a_long_time",
        "expire_at_dt":"3000-01-01T00:00:00Z",
        "timestamp_dt":"2014-05-07T22:17:51.276Z",
        "_version_":1467483235314302976}]
  }}

Of the 2 docs remaining, one has no expiration date and the other won’t expire in our lifetimes — but as long as the Solr server is running, the DocExpirationUpdateProcessorFactory will keep checking every 30 seconds to see if something needs deleted.