Better Feature Engineering with Spark, Solr, and Lucene Analyzers

This blog post is about new features in the Lucidworks spark-solr open source toolkit. For an introduction to the spark-solr project, see Solr as an Apache Spark SQL DataSource

Performing text analysis in Spark

The Lucidworks spark-solr open source toolkit now contains tools to break down full text into words a.k.a. tokens using Lucene’s text analysis framework. Lucene text analysis is used under the covers by Solr when you index documents, to enable search, faceting, sorting, etc. But text analysis external to Solr can drive processes that won’t directly populate search indexes, like building machine learning models. In addition, extra-Solr analysis can allow expensive text analysis processes to be scaled separately from Solr’s document indexing process.

Lucene text analysis, via `LuceneTextAnalyzer`

The Lucene text analysis framework, a Java API, can be used directly in code you run on Spark, but the process of building an analysis pipeline and using it to extract tokens can be fairly complex. The spark-solr LuceneTextAnalyzer class aims to simplify access to this API via a streamlined interface. All of the analyze*() methods produce only text tokens – that is, none of the metadata associated with tokens (so-called “attributes”) produced by the Lucene analysis framework is output: token position increment and length, beginning and ending character offset, token type, etc. If these are important for your use case, see the “Extra-Solr Text Analysis” section below.

LuceneTextAnalyzer uses a stripped-down JSON schema with two sections: the analyzers section configures one or more named analysis pipelines; and the fields section maps field names to analyzers. We chose to define a schema separately from Solr’s schema because many of Solr’s schema features aren’t applicable outside of a search context, e.g.: separate indexing and query analysis; query-to-document similarity; non-text fields; indexed/stored/doc values specification; etc.

Lucene text analysis consists of three sequential phases: character filtering – whole-text modification; tokenization, in which the resulting text is split into tokens; and token filtering – modification/addition/removal of the produced tokens.

Here’s the skeleton of a schema with two defined analysis pipelines:

{ "analyzers": [{ "name": "...",
                    "charFilters": [{ "type": "...", ...}, ... ], 
                    "tokenizer": { "type": "...", ... },
                    "filters": [{ "type": "...", ... } ... ] }] },
                { "name": "...", 
                    "charFilters": [{ "type": "...", ...}, ... ], 
                    "tokenizer": { "type": "...", ... },
                    "filters": [{ "type": "...", ... }, ... ] }] } ],
  "fields": [{"name": "...", "analyzer": "..."}, { "regex": ".+", "analyzer": "..." }, ... ] }

In each JSON object in the analyzers array, there may be:

zero or more character filters, configured via an optional charFilters array of JSON objects;
exactly one tokenizer, configured via the required tokenizer JSON object; and
zero or more token filters, configured by an optional filters array of JSON objects.

Classes implementing each one of these three kinds of analysis components are referred to via the required type key in these components’ configuration objects, the value for which is the SPI name for the class, which is simply the case-insensitive class’s simple name with the -CharFilterFactory, -TokenizerFactory, or -(Token)FilterFactory suffix removed. See the javadocs for Lucene’s CharFilterFactory, TokenizerFactory and TokenFilterFactory classes for a list of subclasses, the javadocs for which include a description of the configuration parameters that may be specified as key/value pairs in the analysis component’s configuration JSON objects in the schema.

Below is a Scala snippet to display counts for the top 10 most frequent words extracted from spark-solr’s top-level README.adoc file, using LuceneTextAnalyzer configured with an analyzer consisting of StandardTokenizer (which implements the word break rules from Unicode’s UAX#29 standard) and LowerCaseFilter, a filter to downcase the extracted tokens. If you would like to play along at home: clone the spark-solr source code from Github; change directory to the root of the project; build the project (via mvn -DskipTests package); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar); type paste: into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish):

import com.lucidworks.spark.analysis.LuceneTextAnalyzer
val schema = """{ "analyzers": [{ "name": "StdTokLower",
               |                  "tokenizer": { "type": "standard" },
               |                  "filters": [{ "type": "lowercase" }] }], 
               |  "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] }
             """.stripMargin
val analyzer = new LuceneTextAnalyzer(schema)
val file = sc.textFile("README.adoc")
val counts = file.flatMap(line => analyzer.analyze("anything", line))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
                 .sortBy(_._2, false) // descending sort by count
println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))

The top 10 token(count) tuples will be printed out:

the(158), to(103), solr(86), spark(77), a(72), in(44), you(44), of(40), for(35), from(34)

In the schema above, all field names are mapped to the StdTokLower analyzer via the "regex": ".+" mapping in the fields section – that’s why the call to analyzer.analyze() uses "anything" as the field name.

The results include lots of prepositions (“to”, “in”, “of”, “for”, “from”) and articles (“the” and “a”) – it would be nice to exclude those from our top 10 list. Lucene includes a token filter named StopFilter that removes words that match a blacklist, and it includes a default set of English stopwords that includes several prepositions and articles. Let’s add another analyzer to our schema that builds on our original analyzer by adding StopFilter:

import com.lucidworks.spark.analysis.LuceneTextAnalyzer
val schema = """{ "analyzers": [{ "name": "StdTokLower",
               |                  "tokenizer": { "type": "standard" },
               |                  "filters": [{ "type": "lowercase" }] },
               |                { "name": "StdTokLowerStop",
               |                  "tokenizer": { "type": "standard" },
               |                  "filters": [{ "type": "lowercase" },
               |                              { "type": "stop" }] }], 
               |  "fields": [{ "name": "all_tokens", "analyzer": "StdTokLower" },
               |             { "name": "no_stopwords", "analyzer": "StdTokLowerStop" } ]}
             """.stripMargin
val analyzer = new LuceneTextAnalyzer(schema)
val file = sc.textFile("README.adoc")
val counts = file.flatMap(line => analyzer.analyze("no_stopwords", line))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
                 .sortBy(_._2, false)
println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))

In the schema above, instead of mapping all fields to the original analyzer, only the all_tokens field will be mapped to the StdTokLower analyzer, and the no_stopwords field will be mapped to our new StdTokLowerStop analyzer.

spark-shell will print:

solr(86), spark(77), you(44), from(34), source(32), query(25), option(25), collection(24), data(20), can(19)

As you can see, the list above contains more important tokens from the file.

For more details about the schema, see the annotated example in the LuceneTextAnalyzer scaladocs.

LuceneTextAnalyzer has several other analysis methods: analyzeMV() to perform analysis on multi-valued input; and analyze(MV)Java() convenience methods that accept and emit Java-friendly datastructures. There is an overloaded set of these methods that take in a map keyed on field name, with text values to be analyzed – these methods return a map from field names to output token sequences.

Extracting text features in `spark.ml` pipelines

The spark.ml machine learning library includes a limited number of transformers that enable simple text analysis, but none support more than one input column, and none support multi-valued input columns.

The spark-solr project includes LuceneTextAnalyzerTransformer, which uses LuceneTextAnalyzer and its schema format, described above, to extract tokens from one or more DataFrame text columns, where each input column’s analysis configuration is specified by the schema.

If you don’t supply a schema (via e.g. the setAnalysisSchema() method), LuceneTextAnalyzerTransformer uses the default schema, below, which analyzes all fields in the same way: StandardTokenizer followed by LowerCaseFilter:

{ "analyzers": [{ "name": "StdTok_LowerCase",
                  "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }],
  "fields": [{ "regex": ".+", "analyzer": "StdTok_LowerCase" }] }

LuceneTextAnalyzerTransformer puts all tokens extracted from all input columns into a single output column. If you want to keep the vocabulary from each column distinct from other columns’, you can prefix the tokens with the input column from which they came, e.g. word from column1 becomes column1=word – this option is turned off by default.

You can see LuceneTextAnalyzerTransformer in action in the spark-solr MLPipelineScala example, which shows how to use LuceneTextAnalyzerTransformer to extract text features to build a classification model to predict the newsgroup an article was posted to, based on the article’s text. If you wish to run this example, which expects the 20 newsgroups data to be indexed into a Solr cloud collection, follow the instructions in the scaladoc of the NewsgroupsIndexer example, then follow the instructions in the scaladoc of the MLPipelineScala example.

The MLPipelineScala example builds a Naive Bayes classifier by performing K-fold cross validation with hyper-parameter search over, among several other params’ values, whether or not to prefix tokens with the column from which they were extracted, and 2 different analysis schemas:

  val WhitespaceTokSchema =
    """{ "analyzers": [{ "name": "ws_tok", "tokenizer": { "type": "whitespace" } }],
      |  "fields": [{ "regex": ".+", "analyzer": "ws_tok" }] }""".stripMargin
  val StdTokLowerSchema =
    """{ "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" },
      |                  "filters": [{ "type": "lowercase" }] }],
      |  "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }""".stripMargin
[...]
  val analyzer = new LuceneTextAnalyzerTransformer().setInputCols(contentFields).setOutputCol(WordsCol)
[...]
  val paramGridBuilder = new ParamGridBuilder()
    .addGrid(hashingTF.numFeatures, Array(1000, 5000))
    .addGrid(analyzer.analysisSchema, Array(WhitespaceTokSchema, StdTokLowerSchema))
    .addGrid(analyzer.prefixTokensWithInputCol)

When I run MLPipelineScala, the following log output says that the std_tok_lower analyzer outperformed the ws_tok analyzer, and not prepending the input column onto tokens worked better:

2016-04-08 18:17:38,106 [main] INFO  CrossValidator  - Best set of parameters:
{
	LuceneAnalyzer_9dc1a9c71e1f-analysisSchema: { "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" },
                  "filters": [{ "type": "lowercase" }] }],
  "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] },
	hashingTF_f24bc3f814bc-numFeatures: 5000,
	LuceneAnalyzer_9dc1a9c71e1f-prefixTokensWithInputCol: false,
	nb_1a5d9df2b638-smoothing: 0.5
}

Extra-Solr Text Analysis

Solr’s PreAnalyzedField field type enables the results of text analysis performed outside of Solr to be passed in and indexed/stored as if the analysis had been performed in Solr.

As of this writing, the spark-solr project depends on Solr 5.4.1, but prior to Solr 5.5.0, querying against fields of type PreAnalyzedField was not fully supported – see Solr JIRA issue SOLR-4619 for more information.

There is a branch on the spark-solr project, not yet committed to master or released, that adds the ability to produce JSON that can be parsed, then indexed and optionally stored, by Solr’s PreAnalyzedField.

Below is a Scala snippet to produce pre-analyzed JSON for a small piece of text using LuceneTextAnalyzer configured with an analyzer consisting of StandardTokenizer+LowerCaseFilter. If you would like to try this at home: clone the spark-solr source code from Github; change directory to the root of the project; checkout the branch (via git checkout SPAR-14-LuceneTextAnalyzer-PreAnalyzedField-JSON); build the project (via mvn -DskipTests package); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar); type paste: into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish):

import com.lucidworks.spark.analysis.LuceneTextAnalyzer
val schema = """{ "analyzers": [{ "name": "StdTokLower",
               |                  "tokenizer": { "type": "standard" },
               |                  "filters": [{ "type": "lowercase" }] }], 
               |  "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] }
             """.stripMargin
val analyzer = new LuceneTextAnalyzer(schema)
val text = "Ignorance extends Bliss."
val fieldName = "myfield"
println(analyzer.toPreAnalyzedJson(fieldName, text, stored = true))

The following will be output (whitespace added):

{"v":"1","str":"Ignorance extends Bliss.","tokens":[
  {"t":"ignorance","s":0,"e":9,"i":1},
  {"t":"extends","s":10,"e":17,"i":1},
  {"t":"bliss","s":18,"e":23,"i":1}]}

If we make the value of the stored option false, then the str key, with the original text as its value, will not be included in the output JSON.

Summary

LuceneTextAnalyzer simplifies Lucene text analysis, and enables use of Solr’s PreAnalyzedField. LuceneTextAnalyzerTransformer allows for better text feature extraction by leveraging Lucene text analysis.

The State of Generative AI in Global Business: 2025 Benchmark Report, Dawn of the Agentic AI Era

The first-of-its-kind study using autonomous AI agents to benchmark AI capabilities across...

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...