Congrats to the Apache Mahout team on their 0.4 release.  Lots of new features and bug fixes like online learning algorithms (Stochastic Gradient Descent), more consistent command line interfaces, new clustering algorithms and many speed improvements.  I’ll be talking on Apache Lucene/Solr and Mahout at ApacheCon this week and will also be releasing Part II of my series on integrating Lucene/Solr with Mahout (part I is here) shortly after I get back.

Here’s the official announcement:

We’re pleased to announce we’ve finally completed the 0.4 release. It will
begin showing up on mirrors shortly, so check back if you can’t find it just
yet from the usual spot:

The complete news item is as follows:

We are pleased to announce release 0.4 of Mahout. Virtually every corner of
the project has changed, and significantly, since 0.3. Developers are
invited to use and depend on version 0.4 even as yet more change is to be
expected before the next release. Highlights include:

– Model refactoring and CLI changes to improve integration and
– New ClusterEvaluator and CDbwClusterEvaluator offer new ways to
evaluate clustering effectiveness
– New Spectral Clustering and MinHash Clustering (still experimental)
– New VectorModelClassifier allows any set of clusters to be used for
– Map/Reduce job to compute the pairwise similarities of the rows of a
matrix using a customizable similarity measure
– Map/Reduce job to compute the item-item-similarities for item-based
collaborative filtering
– RecommenderJob has been evolved to a fully distributed item-based
– Distributed Lanczos SVD implementation
– More support for distributed operations on very large matrices
– Easier access to Mahout operations via the command line
– New HMM based sequence classification from GSoC (currently as
sequential version only and still experimental)
– Sequential logistic regression training framework
– New SGD classifier
– Experimental new type of NB classifier, and feature reduction options
for existing one
– New vector encoding framework for high speed vectorization without a
pre-built dictionary
– Additional elements of supervised model evaluation framework
– Promoted several pieces of old Colt framework to tested status (QR
decomposition, in particular)
– Can now save random forests and use it to classify new data
– Many, many small fixes, improvements, refactorings and cleanup

Details on what’s included can be found in the release
Downloads are available from theApache