Introduction
As Apache Mahout is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with Apache Lucene and Apache Solr. For those who aren’t aware of Mahout, it is an ASF project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly). Mahout has a variety of algorithms already implemented, ranging from clustering to classification and collaborative filtering. For more on Mahout, see my TriJUG talk or my developerWorks article. Instead of going over the litany of things implemented in Mahout, I’ll give a quick recap of what the primary features of 0.3 are:
- New math, collections modules based on the time tested Colt project
- LLR (Log-likelihood ratio – See LucidWorks advisor Ted Dunning’s blog entry for more info) co-location implementation
- Hadoop-based Lanczos SVD (Singular Value Decomposition) solver — good for feature reduction, which is a common requirement at scale
- Shell scripts for easier running of algorithms, examples
- Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
- Parallel Dirichlet process clustering (model-based clustering algorithm)
- Parallel co-occurrence based recommender
- Code cleanup, many bug fixes and performance improvements
- A new Logo:
Enough of the background; let’s get to what we can do right now. I’ll break it down into three groups:
- Lucene/Solr as a Data Source for Mahout batch processing
- Document/Results Augmentation (clustering, classification, recommendations)
- Learning about your data and your users (log analysis with Apache Mahout)
In Part I (this post), I’m going to focus on #1 as a way for people to get started without having to do any coding. In Part II, I’ll focus on #2 and finally, as you might guess, Part III will focus on #3.
Lucene/Solr as a Data Source for Mahout
Most Apache Mahout algorithms run off of Feature Vectors. For those in the Lucene world, a feature vector should feel very familiar. It is, more or less a document, or some subset of a document. Specifically, a feature vector is a tuple of features that are useful for the algorithm. It is up to you to determine what features work best. In many cases for Mahout, a vector is simply a tuple of weights for each of the words in a document. In other cases, they might be the values from the output of some manufacturing process. Do note that the features for having good search capabilities are often different than those needed for good machine learning. For instance, in my experiments with Mahout’s clustering capabilities, I need far more aggressive stopword removal to get good results than I do for search. (In fact, for search these days, I often don’t even remove stopwords, but instead deal with them at query time, but that is a whole other post.)
There are two different ways for Mahout to use Lucene/Solr as a data source:
- Utilize Lucene’s term vector capability to create Mahout feature vectors.
- Programmatically access low level Lucene features like TermEnum, TermDocs, TermPositions, etc. to construct features.
For this post, I’m going to focus on #1, as I have yet to even have a need for #2, even though in theory it could be done.
Mahout Vectors from Lucene Term Vectors
In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors. A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.
For this example, I’m going to use Solr’s example, located in <Solr Home>/example
In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:
<field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>
For pure Lucene, you will need to set the TermVector option on during Field creation, as in:
Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);
From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:
<MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2
A few things to note about this command:
- This outputs a single vector file, title part-out.vec to the target/foo directory
- It uses the title-clustering field. If you want a combination of fields, then you will have to create a single “merged” field containing those fields. Solr’s <copyField> syntax can make this easy.
- The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
- The –dictOut outputs the list of terms that are represented in the Mahout vectors. Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later. As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
- The –norm tells Mahout how to normalize the vector. For many Mahout applications, normalization is a necessary process for obtaining good results. In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity. Other approaches may require other norms.
Obviously, this script above can be run at any time, but I think it is even more interesting to hook it into Solr’s event system, with caveats. For those who aren’t familiar, Solr provides an event call back system for events like commit and optimize (see also the LucidWorks Reference Guide). Hooking into the event system is as simple as setting up the appropriate event listener. For this example, I’m going to hook into the commit listener by having it call out to the Mahout script above:
<listener event=”postCommit”>
<str name=”exe”>/Volumes/User/grantingersoll/projects/lucene/mahout/clean/bin/mahout</str>
<str name=”dir”>.</str>
<bool name=”wait”>false</bool>
<arr name=”args”>
<str>lucene.vector</str>
<str>–dir</str>
<str>./solr/data/index/</str>
<str>–output</str>
<str>/tmp/mahout/vectors/part-out.vec</str>
<str>–field</str>
<str>text</str>
<str>–idField</str>
<str>id</str>
<str>–dictOut</str>
<str>/tmp/mahout/vectors/dict.dat</str>
<str>–norm</str>
<str>2</str>
<str>–maxDFPercent</str>
<str>90</str>
</arr>
</listener>
From here, one can easily extrapolate how a script could be written to then call Mahout’s other methods, namely things like clustering and Latent Dirichlet Allocation (LDA) for topic modeling. Alternatively, one could set up a process to watch for changes to the vector and then spawn a process to go and run the appropriate Mahout tasks.
So, what are the caveats with the above approach?
- If you are running in a commit heavy environment, you may not want to run Mahout on every commit. Mahout is designed for batch processing (well, most of it is, anyway) and most of these jobs are designed to run on Hadoop clusters. In order to do that, you would have to modify the above paths, etc. to have it output to Hadoop’s HDFS, which I’ll leave as an exercise to the reader (the Mathematician in me always enjoys saying that!)
- If you are running Solr in a distributed environment, you’re going to have to set things up appropriately on each node. Hopefully, as the Solr Cloud stuff matures, this will become even simpler and we should be able to do some really smart things to make Mahout and Solr work together in a distributed environment. For now, you’re on your own.
In the next posting, I’ll look at how we can more closely hook in Mahout into the indexing and search process. As a teaser, think about how you could use Mahout to classify and cluster large volumes of text and then have that information available for things like faceting, discovery and filtering on the search side.
As always, let me know if you have any questions or comments.
References
- Mahout In Action by Owen and Anil. Manning Publications.
- Various Solr and Lucene books, all linked via LucidWorks.
- http://lucene.apache.org/mahout
- http://cwiki.apache.org/MAHOUT
- Grant’s Blog has a number of articles on Mahout
Comments
Getting Started with Solr & Carrot2 Clustering | Thinknook
[...] can learn and go big!, Integrating Apache Mahout with Solr: An interesting approach combining Mahout’s awesome machine learning capabilities with [...]
Solr and Mahout « More Power Later
[...] lots of helpful information out there on using Mahout with Solr. Grant Ingersoll’s post, Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3) got me started, but like many of the commenters, I was pining for the missing sequels. Next came [...]
chinna
mahout lucene.vector --dir apache-solr-3.6.0/example/solr/resume_en/data/index/ --field text --maxPercentErrorDocs 0 --output out.vec --idField id --dictOut dict.txt --norm 2
text is the copyfield having term vectors on 'features' field.
Hi,
I am getting following error when executing above command.
lucene.LuceneIterator: There are too many documents that do not have a term vector for text
Exception in thread "main" java.lang.IllegalStateException: There are too many documents that do not have a term vector for text
Could you please help me out what could be the problem?
Thanks
chinna
Nizam
Is there a Part II?
hung
Hi,
Thanks for the post. I tried the command
bin/mahout lucene.vector -dir /home/notroot/solr/tcp/solr/data/index/ -output /tmp/part-out.vec -field text -idField id -dictOut /tmp/dict.out -norm 2
I obtained the error:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/notroot/mahout1/trunk/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/notroot/mahout1/trunk/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/notroot/mahout1/trunk/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
12/05/18 17:54:16 ERROR lucene.Driver: Exception
org.apache.commons.cli2.OptionException: Unexpected 2 while processing Options
at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
Could you help me to solve it
Bests,
Hung
Clustering text using Mahout fed with Lucene n-gram termVectors « kopping
[...] http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-sol... [...]
Setting up Hadoop 0.20.203.0, Mahout 0.5 and Lucene 3.5.0 « kopping
[...] about using Lucene to create feature vectors for input to Mahout. This procedure is described in an article by Mr. Grant Ingersoll himself, however it has not been quite out of the box for me. He is not telling it [...]
Selvam
Hi,
Great post. I would like to learn about "Mahout to classify and cluster large volumes of text and then have that information available for things like faceting, discovery and filtering on the search side". Curious to find a article on it. Have you written on it ?
Sid
Hi,
Is it possible to create vectors out of multiple term fields from a Lucene index? The above example specifies
"It uses the title-clustering field. If you want a combination of fields, then you will have to create a single “merged” field containing those fields. Solr’s syntax can make this easy."
Fields which are termed cannot be combined in a single field as each field holds an integer value.
Any pointer will be appreciated?
Thanks,
Sid
Janki
Hi,
I am new to Mahout and Lucene. I want to do clustering of users. I have 7 dimensions (features) in data. I have tried kMeans clustering taking data from csv. Now I want to get data from Lucene. I have one question that while converting lucene documents to vectors, how will it consider dimensions? How should I generate Lucene documents if I want to generate vectors with n dimesions (features)?
Mark Rosenberg
Trying the cookbook example provided by the article with Mahout trunk and Solr 3.4.0. Looks like --field title-clustering doesn't have enough term vectors so I may be running afoul of https://issues.apache.org/jira/browse/MAHOUT-675.
11/11/02 14:16:41 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for title-clustering
Exception in thread "main" java.lang.IllegalStateException: There are too many documents that do not have a term vector for title-clustering.
If I use --field text then mahout completes normally and writes 17 vectors. The recommendation to use copyField to accumulate field contents in a new title-clustering field appears to be mandatory if the article's mahout command line is to be used without modification.
Grant Ingersoll
Mahout trunk should now be on Lucene 3.4. In general, if you are replacing the jars, I think you need to make sure they are packaged in to Mahout's Job jars correctly.
Bob Stewart
I have the same problem, seems that Lucene version is out of sync between Solr and Mahout. Question is how exactly do I make them in sync? I have mahout having lucene-core-3.1.0.jar in mahout/lib directory. I have Solr 3.4. I downloaded Lucene 3.4 jar files and replaced lucene jars inside mahout/lib but that did not work (doesnt seem that mahout loads those lucene jars at all). So how to I make sure they use the same lucene version? I am somewhat new to java/linux world.
Grant Ingersoll
You should be fine upgrading Mahout's version. In fact, we should do it in Mahout. Feel free to open an issue there. Although, the Java 7 issue and the 3.4.0 issue are separate. The 3.4.0 issue was due to a fsync issue in Lucene 3.3.0
Mark Rosenberg
Hi Grant,
Thanks for the quick response! We seem to be in an awkward situation WRT Mahout and Solr Lucene version dependencies. I'm using Mahout 0.6 snapshot, which has a Lucene 3.3.0 dependency. Due to Oracle Java 7 sabotage, Lucene users are being advised to upgrade to 3.4.0. Do I have an alternative to using the Mahout 0.5 release?
Grant Ingersoll
Hi Mark,
the issue here is likely a version mismatch between the Lucene version in Mahout and the Lucene version you created your index with. If you sync those up, you should be fine.
Mark Rosenberg
I'm having some trouble getting this to work with my own data. I issue the following command:
mahout lucene.vector --dir /home/markr/shgs/apache-solr-3.4.0/example/solr/data/index/ --output /tmp/part-out.vec --field content_encoded --idField id --dictOut /tmp/dict.out --norm 2
My intent is to generate term vectors for the content_encoded field whose schema.xml entry has the termVectors="true" attribute setting. There is also a field named 'id'. My data was imported into a sqlite3 db, and id is 'not null', but content_encoded may be null. When I run, I get the SLF4J multiple binding warning (just a warning?), and then the following exception:
Exception in thread "main" org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file "_b.fnm"
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:351)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:72)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:114)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:92)
at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)
at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)
at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:750)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:428)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:288)
at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:84)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
Advise on how to debug this problem would be greatly appreciated.
Mark
Moshe Lichman
Hi, great post.
Have a Q though - I'm running the MAHOUT through the Eclipse and I created the vector from my Lucene index. Two file were created:
1. The vector file.
2. The Dict file.
When running the FuzzyKMeans on the vector file - I got Exception while the job was parsing it - NotANumber Exception - for the vec file is a 'compiled' file. Any ideas how to make it work?
Matthew Sacks
Hi Grant,
If you eventually wanted to dump results into Solritas (VelocityRepsonseWriter), what would the flow of data need to look like? Raw Data->Lucene->Mahout->Solr?
Thanks,
Matthew
Joyce Babu
Thanks for the interesting post. Will surely keep checking for the second part, even thought the chances seem slim.
David
Hey,
Is there going to a part 2 and 3 of this series it very interesting
Regards,
Dave
Lucid Imagination » Apache Mahout 0.4 Released
[...] and will also be releasing Part II of my series on integrating Lucene/Solr with Mahout (part I is here) shortly after I get [...]
Frank Scholten
@Khoa: That's is a bug. See https://issues.apache.org/jira/browse/MAHOUT-501. If you rename conf/lucenevector.props to conf/lucene.vector.props it will work.
Khoa
Hi Grant,
Thanks for an interesting topic!
I got the warning "No lucene.vector.props found on classpath..." when running the command below. Can you please advise? Thanks!!!
/bin/mahout lucene.vector –dir /example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2
WARNING: No lucene.vector.props found on classpath, will use command-line argume
nts only
Aug 5, 2010 11:17:40 AM org.slf4j.impl.JCLLoggerAdapter error
SEVERE: Exception
org.apache.commons.cli2.OptionException: Unexpected 2 while processing Options
at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:125)
Matt Mitchell
Thanks Grant. Some really interesting possibilities with this combination. I'm really interested in how you can integrate Solr with Mahout's clustering. Can't wait to read the next part in the series!
Matt
Grant Ingersoll
I've been working on it, slowly but surely, but unfortunately other more pressing issues have gotten in the way. Hope to get it out soon.
Amit
Hi Grant,
Thanks for great topic...
We are waiting for next part of this topic......
Could please tell us we can see that...
mohamad
how to apply the clustring algorithms on the vectors after being formed and how to convert vectors to sparse vectors :programming wise of course
Webhamer Weblog: Search & ICT-related blogging » links for 2010-04-22
[...] Lucid Imagination » Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3) As Apache Mahout is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with Apache Lucene and Apache Solr. For those who aren’t aware of Mahout, it is an ASF project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly). Mahout has a variety of algorithms already implemented, ranging from clustering to classification and collaborative filtering. (tags: mahout lucene apache solr tutorial todo oekeleboekie) [...]