Integrating Advanced Text Analytics into Solr/Lucene

“Metadata is king!” Thus proclaimed Steve Kearns of Basis Technology, Platinum Sponsor of Lucene Revolution, at the start of this standing-room-only session on Day 1 of the conference. Why? Because it provides a way to enhance otherwise unstructured data with a considerable amount of structure.

Here are the slides for this session.

With this premise in place, Steve discussed the use and integration of advanced analytics in the document-processing pipeline, focusing on the three levels to which they apply: namely, the document, sub-document, and cross-document levels.

Meta-data derivable at the document level includes identification of the language in which the document is written and the category in which is it properly placed. Steve touches on some especially interesting challenges posed by Asian languages, and mentioned the fact that this level of analysis was useful for creating document search dashboards, useful to those responsible for assessing and maintaining document search quality.

The amount of information that can be gleaned from sub-document analysis is immense. Some of the processes involved at this level include basic stemming and its cousin, lemmatization. Among the more advanced techniques are entity extraction, relationship and event extraction, sentiment analysis, and the mapping of extracted items to real-world concepts in a process called “co-reference resolution.”

Key uses of cross-document analysis include, for example, document clustering, i.e., finding a set of documents that are “more like” one another than another set would be.

An aspect of the presentation of great interest to Solr users focuses on how to integrate analytics, like those provided by Basis, into the Solr pipeline. Not surprisingly, there are a lot of ways to do this. The biggest question you need to answer is: Should I run the analytics within Solr, or should treat them as external calls?

Steve wrapped this useful talk with some approaches to both techniques, including UpdateRequest processing and a list of some tools (e.g., UIMA, GATE, and OpenPipeline) to consider when the time to implement arrives.

Cross-posted with Lucene Revolution Blog. Tony Barreca is a guest blogger.This is one of a series of presentation summaries from the conference.