As Apache Mahout is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with Apache Lucene and Apache Solr. For those who aren’t aware of Mahout, it is an ASF project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly). Mahout has a variety of algorithms already implemented, ranging from clustering to classification and collaborative filtering. For more on Mahout, see my TriJUG talk or my developerWorks article. Instead of going over the litany of things implemented in Mahout, I’ll give a quick recap of what the primary features of 0.3 are:
- New math, collections modules based on the time tested Colt project
- LLR (Log-likelihood ratio – See Lucidworks advisor Ted Dunning’s blog entry for more info) co-location implementation
- Hadoop-based Lanczos SVD (Singular Value Decomposition) solver — good for feature reduction, which is a common requirement at scale
- Shell scripts for easier running of algorithms, examples
- Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
- Parallel Dirichlet process clustering (model-based clustering algorithm)
- Parallel co-occurrence based recommender
- Code cleanup, many bug fixes and performance improvements
- A new Logo:
Enough of the background; let’s get to what we can do right now. I’ll break it down into three groups:
- Lucene/Solr as a Data Source for Mahout batch processing
- Document/Results Augmentation (clustering, classification, recommendations)
- Learning about your data and your users (log analysis with Apache Mahout)
In Part I (this post), I’m going to focus on #1 as a way for people to get started without having to do any coding. In Part II, I’ll focus on #2 and finally, as you might guess, Part III will focus on #3.
Lucene/Solr as a Data Source for Mahout
Most Apache Mahout algorithms run off of Feature Vectors. For those in the Lucene world, a feature vector should feel very familiar. It is, more or less a document, or some subset of a document. Specifically, a feature vector is a tuple of features that are useful for the algorithm. It is up to you to determine what features work best. In many cases for Mahout, a vector is simply a tuple of weights for each of the words in a document. In other cases, they might be the values from the output of some manufacturing process. Do note that the features for having good search capabilities are often different than those needed for good machine learning. For instance, in my experiments with Mahout’s clustering capabilities, I need far more aggressive stopword removal to get good results than I do for search. (In fact, for search these days, I often don’t even remove stopwords, but instead deal with them at query time, but that is a whole other post.)
There are two different ways for Mahout to use Lucene/Solr as a data source:
- Utilize Lucene’s term vector capability to create Mahout feature vectors.
- Programmatically access low level Lucene features like TermEnum, TermDocs, TermPositions, etc. to construct features.
For this post, I’m going to focus on #1, as I have yet to even have a need for #2, even though in theory it could be done.
Mahout Vectors from Lucene Term Vectors
In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors. A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.
For this example, I’m going to use Solr’s example, located in <Solr Home>/example
In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:
<field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>
For pure Lucene, you will need to set the TermVector option on during Field creation, as in:
Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);
From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:
<MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2
A few things to note about this command:
- This outputs a single vector file, title part-out.vec to the target/foo directory
- It uses the title-clustering field. If you want a combination of fields, then you will have to create a single “merged” field containing those fields. Solr’s <copyField> syntax can make this easy.
- The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
- The –dictOut outputs the list of terms that are represented in the Mahout vectors. Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later. As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
- The –norm tells Mahout how to normalize the vector. For many Mahout applications, normalization is a necessary process for obtaining good results. In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity. Other approaches may require other norms.
Obviously, this script above can be run at any time, but I think it is even more interesting to hook it into Solr’s event system, with caveats. For those who aren’t familiar, Solr provides an event call back system for events like commit and optimize (see also the Lucidworks Reference Guide). Hooking into the event system is as simple as setting up the appropriate event listener. For this example, I’m going to hook into the commit listener by having it call out to the Mahout script above:
From here, one can easily extrapolate how a script could be written to then call Mahout’s other methods, namely things like clustering and Latent Dirichlet Allocation (LDA) for topic modeling. Alternatively, one could set up a process to watch for changes to the vector and then spawn a process to go and run the appropriate Mahout tasks.
So, what are the caveats with the above approach?
- If you are running in a commit heavy environment, you may not want to run Mahout on every commit. Mahout is designed for batch processing (well, most of it is, anyway) and most of these jobs are designed to run on Hadoop clusters. In order to do that, you would have to modify the above paths, etc. to have it output to Hadoop’s HDFS, which I’ll leave as an exercise to the reader (the Mathematician in me always enjoys saying that!)
- If you are running Solr in a distributed environment, you’re going to have to set things up appropriately on each node. Hopefully, as the Solr Cloud stuff matures, this will become even simpler and we should be able to do some really smart things to make Mahout and Solr work together in a distributed environment. For now, you’re on your own.
In the next posting, I’ll look at how we can more closely hook in Mahout into the indexing and search process. As a teaser, think about how you could use Mahout to classify and cluster large volumes of text and then have that information available for things like faceting, discovery and filtering on the search side.
As always, let me know if you have any questions or comments.