Apache Solr 5.0 Highlights

The much anticipated Apache Lucene and Solr 5.0 was just released. It comes packed with tons of new features, stability improvements and bug fixes.

Usability improvements

A lot of effort has gone into making Solr more usable, mostly along the lines of introducing APIs and hiding implementation details for users who don’t need to know. Solr 4.10 was released with scripts to start, stop and restart Solr instance, 5.0 takes it further in terms of what can be done with those. The scripts now for instance, copy a configset on collection creation so that the original isn’t changed. There’s also a script to index documents as well as the ability to delete collections in Solr. As an example, this is all you need to do to start SolrCloud, index lucidworks.com, browse through what’s been indexed, and clean up the collection.

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted https://lucidworks.com
open http://localhost:8983/solr/gettingstarted/browse
bin/solr delete -c gettingstarted

Another important thing to note for new users is that Solr no longer has the default collection1 and instead comes with multiple example config-sets and data.

Managing Solr

Like everything else, managing Solr, both on the application configuration and systems front has only gotten easier. It’s now possible to define a set of parameters in Solr and use them at request time. This is similar to the defaults section in solrconfig.xml but what’s really good is that these param sets can be modified live with an HTTP API without requiring a core or collection reload. In addition, request handler parameters can be shared through initParams, which also support overriding and extending of params.

The Schema API hides a lot of implementation details under the covers, making it really easy for users to add field types and dynamic fields. The Config API allows editing common solr config values leaving the implementation details (read xml) behind the scenes.

On the Collections API front, the new BALANCESHARDUNIQUE Collections API enables even distribution of replica properties whereas other APIs have become more robust with numerous bug-fixes.

Solr 5.0 now logs transaction log replay status and doesn’t keep users not know what’s going on during a time-consuming node restart. It also optionally supports slow request logging.

For all the systems guys and devOps who deploy and manage Solr clusters, Solr 5.0 now ships with scripts to support installing and running Solr as a service on *nix platform.

Scalability

Solr 5.0 now scales better with large number of collections by changing the way cluster state is stored. Until the previous version, the entire state was written to a single file, which was then watched by every node and updated by each of them on every change. Starting with this release, each collection would have it’s own cluster state by default, which means that every node would only end up watching cluster states that it cares about.
Improved Solr-ZooKeeper communication, better default timeouts and speeding up of Overseer operations also help SolrCloud scale more easily and reliably.

Stability

Solr now allows for configuration that controls how Solr runs under the covers. Replication handler now has an option to throttle the replication bandwidth consumption to avoid it from sucking up all of the internal bandwidth. timeAllowed, a parameter that has existed for a while in Solr, now cut-offs requests during the query-expansion stage as well. Earlier, a wild-card query could run-off, consuming all the CPU, despite having supplied a timeAllowed value as it was only respected during doc collection.

Distributed IDF

A feature that was first visualized almost 5 years ago and had a lot of contributors spend a lot of time, support for distributed IDF has also been released with Solr 5.0. It supports 4 implementations out of the box including both, local and global stats. By default, it is set to use LocalStats i.e. not use global IDF values, but another implementation can be selected by setting the value in solrconfig.xml.

Stats component

Stats component in Solr now allows statistics to be computed for arbitrary functions. It also allows stats to hang off pivots, in other words, each facet can be asked to compute stats by using tags.

And more

The new DateRangeField allows indexing date ranges, specially multi-valued ones. Also, spatial fields that used to require degrees to compute distance, now also accept kilometers and miles instead making it easier to use.

The MoreLikeThis query parser helps in retrieving documents similar to a given document id. Though it uses the pre-existing MoreLikeThis logic from Lucene, it supports running in SolrCloud mode.

The Blob storage API allows users to upload custom handler jars into Solr and then use the Config API to register those, making it really easy to register and distribute custom handlers in a Solr cluster.

It doesn’t end here, SolrJ now has first class support for Collections API. Also, Tika has now been updated to 1.7, which translates to support for parsing Outlook PST and MATLAB files.

Jepsen tests

Between the last release of Solr and this one, we at Lucidworks, specially Shalin Shekhar Mangar worked on testing SolrCloud using the Jepsen tests. The results came back good and only went on to prove that Solr isn’t just a feature rich search engine but also one that can be relied upon.

New identity

Though it’s not a part of the CHANGE log but Solr got a whole new identity which included but wasn’t restricted to an awesome logo and website.

What’s next?

From how it looks like, Solr 5.1 already has interesting features and fixes committed. With features like spatial 2D heat-map faceting on RPT fields, REBALANCESHARDS collections API and a lot more, it already seems to have a good amount of new features.

Lucene 5.0

And while this post has already stretched to being a long one, it’s important to realize that Solr, with every release also packs in all the goodness of Lucene improvements by default. 5.0 in particular brings among other things, stronger index safety, reduced heap usage, suggester from multi-valued fields, and addition of auto-IO-throttling to ConcurrentMergeScheduler. You can read more about Lucene 5.0 highlights in the blog by Mike McCandless here.

Try it now with the Official Reference Guide for Solr 5.0

If any of the above features excite you and you haven’t tried out Solr 5.0 yet, try it out now. You can download it from here.

To help you with trying out all the new features, for the first time the version specific official Solr reference guide was also released with the code. You can download it here.

Share the knowledge

The New SEO: How to Make Your Products Discoverable by AI Assistants

Quick Take: AI assistants are increasingly performing product research on behalf of...

AI Product Discovery vs. Traditional Search in B2B Manufacturing and Distribution

In the high-stakes world of B2B manufacturing and distribution, the "findability" of...

Is Your Product Catalog Ready for AI Buyers?

AI assistants are increasingly acting as buyers on customers' behalf. Instead of...