The first alpha release of Solr 4 is quickly approaching, bringing powerful new features to enhance existing Solr powered applications, as well as enabling new applications by further blurring the lines between full-text search and NoSQL.
The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. Distributed indexing with no single points of failure has been designed from the ground up for near real-time (NRT), and NoSQL features such as realtime-get, optimistic locking, and durable updates.
We’ve incorporated Apache ZooKeeper, the rock-solid distributed coordination project that is immune to issues like split-brain syndrome that tend to plague other hand-rolled solutions. ZooKeeper holds the Solr configuration, and contains the cluster meta-data such as hosts, collections, shards, and replicas, which are core to providing an elastic search capability.
When a new node is brought up, it will automatically be assigned a role such as becoming an additional replica for a shard. A bounced node can do a quick “peer sync” by exchanging updates with its peers in order to bring itself back up to date. New nodes, or those that have been down too long, recover by replicating the whole index of a peer while concurrently buffering any new updates.
An update can be sent to any node in the cluster, and it’s automatically forwarded to the correct node and immediately replicated to a number of other nodes to enable fault tolerance, high availability, and query scalability. Likewise, queries may be sent to any node in a cluster and they will automatically be routed to the correct nodes and load balanced across replicas. This single-document push model of replication fits in well with the near real-time support that is exposed via Solr’s softCommit to quickly make updates visible to searches.
The SolrCloud wiki page is a good place to start learning more about Solr’s new distributed capabilities.
Solr 4 has more NoSQL features for applications wishing to use it as a primary data store, including
- Update durability – A transaction log ensures that even uncommitted documents are never lost
- Real-time Get – The ability to retrieve the latest version of a document, without the need to commit or open a new searcher
- Versioning and Optimistic Locking – combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients.
There are many other features coming in Solr 4, such as
- Pivot Faceting – Multi-level or hierarchical faceting where the top constraints for one field are found for each top constraint of a different field.
- Pseudo-fields – The ability to alias fields, or to add metadata along with returned documents, such as function query values and results of spatial distance calculations.
- A spell checker implementation that can work directly from the main index instead of creating a sidecar index.
- Pseudo-Join functionality – The ability to select a set of documents based on their relationship to a second set of documents.
- Function query enhancements including conditional function queries and relevancy functions.
- New update processors to facilitate modifying documents prior to indexing
We’re not done yet! There are other features already on the drawing board for Solr 4.x, including
- Update-able documents – the ability to add fields to an existing document without having to send in the complete document again.
- More dynamic schema, including the ability to dynamically add new fields on the fly
- Enhanced elasticity – the ability to split existing shards in a cluster
- Rack awareness
- Index and shard aliases
Although the list of improvements in Solr 4 is too long to describe all of them here,
I’ll leave you with some parting screenshots of the new admin pages.