Integrating Advanced Text Analytics into Solr/Lucene

“Metadata is king!” Thus proclaimed Steve Kearns of Basis Technology, Platinum Sponsor of Lucene Revolution, at the start of this standing-room-only session on Day 1 of the conference. Why? Because it provides a way to enhance otherwise unstructured data with a considerable amount of structure.

Here are the slides for this session.

With this premise in place, Steve discussed the use and integration of advanced analytics in the document-processing pipeline, focusing on the three levels to which they apply: namely, the document, sub-document, and cross-document levels.

Meta-data derivable at the document level includes identification of the language in which the document is written and the category in which is it properly placed. Steve touches on some especially interesting challenges posed by Asian languages, and mentioned the fact that this level of analysis was useful for creating document search dashboards, useful to those responsible for assessing and maintaining document search quality.

The amount of information that can be gleaned from sub-document analysis is immense. Some of the processes involved at this level include basic stemming and its cousin, lemmatization. Among the more advanced techniques are entity extraction, relationship and event extraction, sentiment analysis, and the mapping of extracted items to real-world concepts in a process called “co-reference resolution.”

Key uses of cross-document analysis include, for example, document clustering, i.e., finding a set of documents that are “more like” one another than another set would be.

An aspect of the presentation of great interest to Solr users focuses on how to integrate analytics, like those provided by Basis, into the Solr pipeline. Not surprisingly, there are a lot of ways to do this. The biggest question you need to answer is: Should I run the analytics within Solr, or should treat them as external calls?

Steve wrapped this useful talk with some approaches to both techniques, including UpdateRequest processing and a list of some tools (e.g., UIMA, GATE, and OpenPipeline) to consider when the time to implement arrives.

Cross-posted with Lucene Revolution Blog. Tony Barreca is a guest blogger.This is one of a series of presentation summaries from the conference.

You Might Also Like

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

Read More

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Read More

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...

Read More

Quick Links