My work at Lucidworks primarily involves helping customers build their desired solutions.  Recently, more than one customer has inquired about doing “entity extraction”.  Entity extraction, as defined on Wikipedia, “seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.”  When drilling down into the specifics of the requirements from our customers, it turns out that many of them have straightforward solutions using built-in (Solr 4.x) components, such as:

* Acronyms as facets
* Key words or phrases, from a fixed list, as facets
* Lat/long mentions as geospatial points

This article will describe and demonstrate how to do these, and as a bonus we’ll also extract URLs found in text too.  Let’s start with an example input and the corresponding output all of the described techniques provides.

Example document textual content:

The CHO airport is at 38.1384683,-78.4527887.
See also: http://www.lat-long.com/Latitude-Longitude-1480221-Virginia-Charlottesville_Albemarle_Airport.html

And after indexing this document in Solr, the document will have several additional fields added to it:

  • extracted_locations: lat/long points that are mentioned in the document content, indexed as geographically savvy points, stored on the document for easy use in maps or elsewhere
  • links: a stored field containing http(s) links
  • acronyms: a searchable/filterable/facetable (but not stored) field containing all three+ letter CAPS acronyms
  • key_phrases: a searchable/filterable/facetable (but also not stored) field containing any key phrases matching a provided list

With the example input, this results in a Solr response like this (/query?q=*:*&facet=on&facet.field=acronyms&facet.field=key_phrases):

{
  "responseHeader":{
    "status":0,
    "QTime":3,
    "params":{
      "facet":"on",
      "q":"*:*",
      "facet.field":["acronyms",
        "key_phrases"]}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"/Users/erikhatcher/dev/lucene_4x/solr/example/exampledocs/gocho.txt",
        "content_type":["text/plain; charset=ISO-8859-1"],
        "resourcename":"/Users/erikhatcher/dev/lucene_4x/solr/example/exampledocs/gocho.txt",
        "content":["The CHO airport is at 38.1384683,-78.4527887.nSee also: http://www.lat-long.com/Latitude-Longitude-1480221-Virginia-Charlottesville_Albemarle_Airport.html"],
        "extracted_locations":["38.1384683,-78.4527887"],
        "links":["http://www.lat-long.com/Latitude-Longitude-1480221-Virginia-Charlottesville_Albemarle_Airport.html"],
        "link_ss":["http://www.lat-long.com/Latitude-Longitude-1480221-Virginia-Charlottesville_Albemarle_Airport.html"],
        "_version_":1439028523249434624}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "acronyms":[
        "CHO",1],
      "key_phrases":[
        "airport",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Note the difference between the two style of fields that get added.  One type of field (acronyms and key_phrases) adds indexed values on the document, but these values are not stored.  This type of field is useful for faceting, searching/filtering, boosting, and even sorting (in the case you can ensure there’s at most one value per document). The extracted values are not stored for this type of field because of the way, in particular when, it was extracted.  If these fields were configured to be stored, they’d contain the full content text by nature of the content being copied to them using a copyField directive. The other type of field (links and extracted_locations) are stored, meaning the extracted values are retrievable values on the document itself.  These values are extracted in a script prior to the analysis/indexing process, and thus can be stored.  It can be a confusing difference, stored vs. indexed, so both styles were provided to illustrate this difference clearly.

Acronym Extraction

In most domains these days, we’re inundated with TLA’s (three letter acronyms, “XML this, I.B.M. that”).  It can be handy to pull these out as facets and allow users to navigate easily to documents that mention ones of interest.

In Solr 4.4, a new TokenFilter was introduced that makes it easy to index chunks of text that match patterns. Here’s the configuration used for this example:

<field name="acronyms" type="caps" indexed="true" stored="false" 
       multiValued="true"/>
<copyField source="content" dest="acronyms"/>
<fieldType name="caps" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.PatternCaptureGroupFilterFactory"
            pattern="((?:[A-Z].?){3,})" preserve_original="false"
    />
  </analyzer>
</fieldType>

Sending in the example text above through the “caps” field type produces the output token of only “CHO”, which you can see above in the Solr response for the acronyms facet as “acronyms”:[“CHO”,1]

Key Phrases

If you’ve got a select list of special terms or phrases for your domain that you’d like to make turn into facets and easily filterable to the documents that contain them, the technique in this section is for you.

Here’s our schema additions, adding a field type, a corresponding field, and a directive to copy the document content to the new key_phrases field:

<field name="key_phrases" type="key_phrases" indexed="true" stored="false" multiValued="true"/>
<copyField source="content" dest="key_phrases"/>
<fieldType name="key_phrases" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ShingleFilterFactory"
            minShingleSize="2" maxShingleSize="5"
            outputUnigramsIfNoShingles="true"
    />
    <filter class="solr.KeepWordFilterFactory"
            words="keep_phrases.txt" ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Where conf/keep_phrases.txt contains:

airport 
restaurant
toy store

The field type analysis configuration uses a simple, whitespace, tokenizer; you may need to adjust this to suit your content.  A shingle filter is used to group, in this case, 2 to 5 tokens into a single token (separated by a single space, by default).  The maxShingleSize should be as large as the largest number of words in a single phrase.  The “keep word” filter only passes through tokens that match the phrases in the keep_phrases.txt file.

After running the example text through this analyzer,
this facet exists: “key_phrases”:[“airport”,1].  This allows users to directly facet by documents that contain the word “airport”, or any of the other key phrases that were contained in the indexed documents.

 

Extracted Locations

We could do the same sort of index-time token extraction/normalization as we’ve seen with both the acronyms and key phrases.  But that just gives us a purely textual value as output from a TextField analysis chain, and does not provide the benefits of having a stored value for UI purposes.  There may be use cases where that’s a good way to index it, but more than likely what you really want to do with lat/long’s extracted from arbitrary text is to attach it to the document as a formal, retrievable, filterable, geospatial field type.  Spatial search is a complex topic and there are many choices for a variety of uses cases from geospatial proximity (“Sir, give me pizza near my house”) to complex polygon intersections, faceting by distance ranges, weighting or sorting by proximity, and the like.  See David Smiley’s talk at Lucene Revolution ’13 and the Lucene/Solr 4 Spatial wiki for more information.

In order to make a text string into a rich field type, it needs to come from the indexer*.  The indexer here means anything from your application code or connector that submits documents up through the end of the update processor chain inside of Solr’s update request handling (the RunUpdateProcessorFactory to be exact).  Field analysis steps, including the PatternCaptureGroupFilter and KeepWordFilter tricks above, are done in this last update processor.  The stored values of the field are defined either from the submitted documents, generated by copyField schema definitions, or fabricated/altered by prior update processing steps.

* There’s nothing precluding generating the results from a textual analysis chain, everything Lucene’s TokenStream (provide link) provides, in any stage.  Doing this is a more advanced topic, though not an unreasonable or unheard of technique.

There’s a nifty script update processor that allows us to write a tiny bit of JavaScript.  In Solr’s 4.x shipping example/ collection1 configuration, a skeleton update-script.js can be activated by uncommenting a little bit of solrconfig.xml.  It’ll look something like this:

<updateRequestProcessorChain name="script">
  <processor class="solr.StatelessScriptUpdateProcessorFactory">
    <str name="script">update-script.js</str>
    <lst name="params">
      <str name="config_param">example config parameter</str>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

The extracted_locations field definition is:

<field name="extracted_locations" type="location_rpt" 
       indexed="true" stored="true" multiValued="true"/>

where location_rpt is a recursive prefix tree field type defined in the example schema already.

The new updateRequestProcessorChain, called “script”, is not set to the default so the indexing pathway must specifically call out this update request processor chain.  I used  Solr’s “simple post tool” (in exampledocs/ post.jar) like this:

java -Dauto 
     -Dparams=update.chain=script
     -jar post.jar 
           gocho.txt

And the update-script.js:

"use strict";
function getMatches(regex, value) {
  var matches = [];
  var captures = regex.exec(value);
  if (captures != null) {
    for (i = 1; i < captures.length; i++) { // skip captures[0], as it's the whole string
      matches.push(captures[i]);
    }
  }

  return matches;
}

function processAdd(cmd) {
  doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
  id = doc.getFieldValue("id");
  logger.info("update-script#processAdd: id=" + id);

  var content = doc.getFieldValue("content"); // Comes from /update/extract

  // very basic lat/long pattern matching this kind of stuff "38.1384683,-78.4527887"
  var location_regexp = /(-?d{1,2}.d{2,7},-?d{1,3}.d{2,7})/g;
  doc.setField("extracted_locations", getMatches(location_regexp, content));

  // basic url pattern, http(s) protocol up through URL path, excluding query string params
  var links_regexp = /(https?://[a-zA-Z-_0-9.]+(?:/[a-zA-Z-_0-9.]+)*/?)/g;
  doc.setField("links", getMatches(links_regexp, content));
}

// The functions below must be defined, but there's rarely a need to implement
// anything in these.

function processDelete(cmd) {
  // no-op
}

function processMergeIndexes(cmd) {
  // no-op
}

function processCommit(cmd) {
  // no-op
}

function processRollback(cmd) {
  // no-op
}

function finish() {
  // no-op
}

 

The document contains, after indexing the example document, an indexed, multivalued, geospatial point field: “extracted_locations”:[“38.1384683,-78.4527887”]

This field, because it is stored and indexed, allows not only document-level retrieval, but also geographic filtering using the spatial functions.

Links

There’s already a “links” field defined in our example schema, so we’ll just co-opt that here.  It’s defined like this:

<field name="links" type="string" indexed="true" 
       stored="true" multiValued="true"/>

In the update-script.js processAdd method, we add this:

// basic url pattern, http(s) protocol up through URL path, 
//excluding query string params
var links_regexp =
       /(https?://[a-zA-Z-_0-9.]+(?:/[a-zA-Z-_0-9.]+)*/?)/g;
doc.setField("links", getMatches(links_regexp, content));

After indexing with this additional step, our document contains a multivalued, stored (and even indexed, and thus searchable) links field: “links”:[“http://www.lat-long.com/Latitude-Longitude-1480221-Virginia-Charlottesville_Albemarle_Airport.html”].

Another approach to extracting links (and e-mail addresses)

Lucene/Solr sport a tokenizer that recognizes not only complete URLs, but also e-mail addresses.  And there’s a token filter that works well in conjunction, allowing only specific token types from passing through.  If you only need e-mail addresses or URLs made searchable and facetable, but not stored document values, you can use the same index analysis techniques used in the earlier sections.

Here are two field types that pull out e-mail addresses and URLs, respectively and independently.

<fieldType name="emails" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.TypeTokenFilterFactory"
            types="email_type.txt" useWhitelist="true"/>
  </analyzer>
</fieldType>
<fieldType name="urls" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.TypeTokenFilterFactory"
types="url_type.txt" useWhitelist="true"/>
  </analyzer>
</fieldType>

Where email_type.txt contains <EMAIL> and url_typ
e.txt contains <URL>, both literally.

There’s a key difference between the two approaches, one using an update processor script to extract links, and the other using analysis tokenization and filtering techniques.  What are the stored values of the field?  Stored values are directly associated with a document and directly retrievable as part of a document.  In both approaches, the indexed terms can be made identical, so the key difference is just about the stored values, leaving the choice up to the needs of your application to dictate the right choices.

Exercises for the Reader

There are many refinements that can be made to the examples shown such as omitting any acronyms that don’t add value to your application (WTF?!), and injecting synonyms for key phrases.  Try your hand at adding these flourishes, using the Solr analysis wiki page for tips.

What about UIMA?

Solr includes UIMA integration.  UIMA (Unstructured Information Management Architecture) is a sophisticated text pipelining tool that annotates text with features or entities.  UIMA itself is a framework that requires plugging in annotators.  The Solr integration is complex, and frankly this type of heavier processing is often best done in a different stage of document processing before it is sent to Solr.  This article does not cover the Solr/UIMA capabilities.

Summary

There are a number of tools, both commercial and open-source, that provide sophisticated entity extraction capabilities.  But before adding complexity, first see if some straightforward tricks like shown here suffice.  I’m all about being pragmatic, and not over-engineering solutions.  Do the simplest possible thing that could possibly work, but no simpler!  While I was able to answer several of our customers requirements with these techniques shown, there were some more complex true machine learned entity extraction needs that require other approaches.

These are just some examples.  There are many ways to accomplish interesting things using the built-in analysis components and update-time scripting.