Better Feature Engineering with Spark, Solr, and Lucene Analyzers
This blog post is about new features in the Lucidworks spark-solr open source toolkit. For an introduction to the spark-solr project, see Solr as an Apache Spark SQL DataSource
Performing text analysis in Spark
The Lucidworks spark-solr open source toolkit now contains tools to break down full text into words a.k.a. tokens using Lucene’s text analysis framework. Lucene text analysis is used under the covers by Solr when you index documents, to enable search, faceting, sorting, etc. But text analysis external to Solr can drive processes that won’t directly populate search indexes, like building machine learning models. In addition, extra-Solr analysis can allow expensive text analysis processes to be scaled separately from Solr’s document indexing process.
Lucene text analysis, via LuceneTextAnalyzer
The Lucene text analysis framework, a Java API, can be used directly in code you run on Spark, but the process of building an analysis pipeline and using it to extract tokens can be fairly complex. The spark-solr LuceneTextAnalyzer
class aims to simplify access to this API via a streamlined interface. All of the analyze*()
methods produce only text tokens – that is, none of the metadata associated with tokens (so-called “attributes”) produced by the Lucene analysis framework is output: token position increment and length, beginning and ending character offset, token type, etc. If these are important for your use case, see the “Extra-Solr Text Analysis” section below.
LuceneTextAnalyzer
uses a stripped-down JSON schema with two sections: the analyzers
section configures one or more named analysis pipelines; and the fields
section maps field names to analyzers. We chose to define a schema separately from Solr’s schema because many of Solr’s schema features aren’t applicable outside of a search context, e.g.: separate indexing and query analysis; query-to-document similarity; non-text fields; indexed/stored/doc values specification; etc.
Lucene text analysis consists of three sequential phases: character filtering – whole-text modification; tokenization, in which the resulting text is split into tokens; and token filtering – modification/addition/removal of the produced tokens.
Here’s the skeleton of a schema with two defined analysis pipelines:
{ "analyzers": [{ "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... } ... ] }] }, { "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... }, ... ] }] } ], "fields": [{"name": "...", "analyzer": "..."}, { "regex": ".+", "analyzer": "..." }, ... ] }
In each JSON object in the analyzers
array, there may be:
- zero or more character filters, configured via an optional
charFilters
array of JSON objects; - exactly one tokenizer, configured via the required
tokenizer
JSON object; and - zero or more token filters, configured by an optional
filters
array of JSON objects.
Classes implementing each one of these three kinds of analysis components are referred to via the required type
key in these components’ configuration objects, the value for which is the SPI name for the class, which is simply the case-insensitive class’s simple name with the -CharFilterFactory
, -TokenizerFactory
, or -(Token)FilterFactory
suffix removed. See the javadocs for Lucene’s CharFilterFactory
, TokenizerFactory
and TokenFilterFactory
classes for a list of subclasses, the javadocs for which include a description of the configuration parameters that may be specified as key/value pairs in the analysis component’s configuration JSON objects in the schema.
Below is a Scala snippet to display counts for the top 10 most frequent words extracted from spark-solr’s top-level README.adoc
file, using LuceneTextAnalyzer
configured with an analyzer consisting of StandardTokenizer
(which implements the word break rules from Unicode’s UAX#29 standard) and LowerCaseFilter
, a filter to downcase the extracted tokens. If you would like to play along at home: clone the spark-solr source code from Github; change directory to the root of the project; build the project (via mvn -DskipTests package
); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar
); type paste:
into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish)
:
import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("anything", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) // descending sort by count println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))
The top 10 token(count) tuples will be printed out:
the(158), to(103), solr(86), spark(77), a(72), in(44), you(44), of(40), for(35), from(34)
In the schema above, all field names are mapped to the StdTokLower
analyzer via the "regex": ".+"
mapping in the fields
section – that’s why the call to analyzer.analyze()
uses "anything"
as the field name.
The results include lots of prepositions (“to”, “in”, “of”, “for”, “from”) and articles (“the” and “a”) – it would be nice to exclude those from our top 10 list. Lucene includes a token filter named StopFilter
that removes words that match a blacklist, and it includes a default set of English stopwords that includes several prepositions and articles. Let’s add another analyzer to our schema that builds on our original analyzer by adding StopFilter
:
import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }, | { "name": "StdTokLowerStop", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }, | { "type": "stop" }] }], | "fields": [{ "name": "all_tokens", "analyzer": "StdTokLower" }, | { "name": "no_stopwords", "analyzer": "StdTokLowerStop" } ]} """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("no_stopwords", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))
In the schema above, instead of mapping all fields to the original analyzer, only the all_tokens
field will be mapped to the StdTokLower
analyzer, and the no_stopwords
field will be mapped to our new StdTokLowerStop
analyzer.
spark-shell
will print:
solr(86), spark(77), you(44), from(34), source(32), query(25), option(25), collection(24), data(20), can(19)
As you can see, the list above contains more important tokens from the file.
For more details about the schema, see the annotated example in the LuceneTextAnalyzer
scaladocs.
LuceneTextAnalyzer
has several other analysis methods: analyzeMV()
to perform analysis on multi-valued input; and analyze(MV)Java()
convenience methods that accept and emit Java-friendly datastructures. There is an overloaded set of these methods that take in a map keyed on field name, with text values to be analyzed – these methods return a map from field names to output token sequences.
Extracting text features in spark.ml
pipelines
The spark.ml
machine learning library includes a limited number of transformers that enable simple text analysis, but none support more than one input column, and none support multi-valued input columns.
The spark-solr project includes LuceneTextAnalyzerTransformer
, which uses LuceneTextAnalyzer
and its schema format, described above, to extract tokens from one or more DataFrame text columns, where each input column’s analysis configuration is specified by the schema.
If you don’t supply a schema (via e.g. the setAnalysisSchema()
method), LuceneTextAnalyzerTransformer
uses the default schema, below, which analyzes all fields in the same way: StandardTokenizer
followed by LowerCaseFilter
:
{ "analyzers": [{ "name": "StdTok_LowerCase", "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }], "fields": [{ "regex": ".+", "analyzer": "StdTok_LowerCase" }] }
LuceneTextAnalyzerTransformer
puts all tokens extracted from all input columns into a single output column. If you want to keep the vocabulary from each column distinct from other columns’, you can prefix the tokens with the input column from which they came, e.g. word
from column1
becomes column1=word
– this option is turned off by default.
You can see LuceneTextAnalyzerTransformer
in action in the spark-solr MLPipelineScala
example, which shows how to use LuceneTextAnalyzerTransformer
to extract text features to build a classification model to predict the newsgroup an article was posted to, based on the article’s text. If you wish to run this example, which expects the 20 newsgroups data to be indexed into a Solr cloud collection, follow the instructions in the scaladoc of the NewsgroupsIndexer
example, then follow the instructions in the scaladoc of the MLPipelineScala
example.
The MLPipelineScala
example builds a Naive Bayes classifier by performing K-fold cross validation with hyper-parameter search over, among several other params’ values, whether or not to prefix tokens with the column from which they were extracted, and 2 different analysis schemas:
val WhitespaceTokSchema = """{ "analyzers": [{ "name": "ws_tok", "tokenizer": { "type": "whitespace" } }], | "fields": [{ "regex": ".+", "analyzer": "ws_tok" }] }""".stripMargin val StdTokLowerSchema = """{ "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }""".stripMargin [...] val analyzer = new LuceneTextAnalyzerTransformer().setInputCols(contentFields).setOutputCol(WordsCol) [...] val paramGridBuilder = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(1000, 5000)) .addGrid(analyzer.analysisSchema, Array(WhitespaceTokSchema, StdTokLowerSchema)) .addGrid(analyzer.prefixTokensWithInputCol)
When I run MLPipelineScala
, the following log output says that the std_tok_lower
analyzer outperformed the ws_tok
analyzer, and not prepending the input column onto tokens worked better:
2016-04-08 18:17:38,106 [main] INFO CrossValidator - Best set of parameters: { LuceneAnalyzer_9dc1a9c71e1f-analysisSchema: { "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }], "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }, hashingTF_f24bc3f814bc-numFeatures: 5000, LuceneAnalyzer_9dc1a9c71e1f-prefixTokensWithInputCol: false, nb_1a5d9df2b638-smoothing: 0.5 }
Extra-Solr Text Analysis
Solr’s PreAnalyzedField
field type enables the results of text analysis performed outside of Solr to be passed in and indexed/stored as if the analysis had been performed in Solr.
As of this writing, the spark-solr project depends on Solr 5.4.1, but prior to Solr 5.5.0, querying against fields of type PreAnalyzedField
was not fully supported – see Solr JIRA issue SOLR-4619 for more information.
There is a branch on the spark-solr project, not yet committed to master or released, that adds the ability to produce JSON that can be parsed, then indexed and optionally stored, by Solr’s PreAnalyzedField
.
Below is a Scala snippet to produce pre-analyzed JSON for a small piece of text using LuceneTextAnalyzer
configured with an analyzer consisting of StandardTokenizer
+LowerCaseFilter
. If you would like to try this at home: clone the spark-solr source code from Github; change directory to the root of the project; checkout the branch (via git checkout SPAR-14-LuceneTextAnalyzer-PreAnalyzedField-JSON
); build the project (via mvn -DskipTests package
); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar
); type paste:
into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish)
:
import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val text = "Ignorance extends Bliss." val fieldName = "myfield" println(analyzer.toPreAnalyzedJson(fieldName, text, stored = true))
The following will be output (whitespace added):
{"v":"1","str":"Ignorance extends Bliss.","tokens":[ {"t":"ignorance","s":0,"e":9,"i":1}, {"t":"extends","s":10,"e":17,"i":1}, {"t":"bliss","s":18,"e":23,"i":1}]}
If we make the value of the stored
option false
, then the str
key, with the original text as its value, will not be included in the output JSON.
Summary
LuceneTextAnalyzer simplifies Lucene text analysis, and enables use of Solr’s PreAnalyzedField. LuceneTextAnalyzerTransformer allows for better text feature extraction by leveraging Lucene text analysis.
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.