TL;DR;
We tested SolrCloud against bridge, random transitive, and fixed transitive network partitions using Jepsen and found no data loss issues for both compare-and-set operations and inserts. One major and a few minor bugs were found. Most have been fixed in Solr 4.10.2 and others will be fixed soon. We’re working on writing better Jepsen tests for Solr and to integrate them into Solr’s build servers. This is a process and it’s not over yet.

Ever since Kyle Kingsbury, popular by his twitter handle @aphyr, started publishing his excellent series of blog posts about network partitions and how distributed systems behave under them, users of Apache Solr have been asking how Solr performs in the face of such partitions. Many distributed systems tested by Jepsen failed with catastrophic data loss, non-linear history, cluster lockups and “instantaneous” leader elections which weren’t so instantaneous. In some cases, the reasons behind the failings were simple bugs, at other times there were architectural problems where distributed systems were written without an actual peer reviewed distributed consensus algorithm backing it.

When designing Solr’s distributed capabilities, we had good reasons to choose Apache ZooKeeper. It is based on sound theory and the amount of testing and scale that Apache ZooKeeper has been put through, while being used by Apache Hadoop, Apache HBase and others is exceptional in the open source world. When Jepsen tested Apache ZooKeeper and recommended it as a mature, well designed and battle-tested system, it was a strong validation of our design choices, but we still had our doubts. This post and its associated work aims to answer those doubts.

What is Jepsen and why should I care?

Jepsen is a tool which simulates network partitions and tests how distributed data stores behave under them. Put in a simple way, Jepsen will cut off one or more nodes from talking to other nodes in a cluster while continuing to insert, update or lookup data during the partition as well as after the partition heals to find if they lose data, read inconsistent data or become unavailable.

Why is this important? This is important because networks are unreliable. It’s not just the network; garbage collection pauses in the JVM or heavy activity by your neighbour in a server running on the cloud can also manifest in slowdowns which are virtually indistinguishable from a network partition. Even in the best managed data centers, things go wrong; disks fail, switches malfunction, power supplies get shorted out, RAM modules die and a distributed system that runs on a large scale should strive to work around such issues as much as possible.

Jepsen works by setting up the data store under test on 5 different hosts (typically Linux Containers on a single host for simplicity). It creates a client, for the data store under test, pointing to each of the 5 nodes to send requests. It also creates special client(s) called a “Nemesis” which instead of talking to a data store, wreak havoc in the cluster by, for example, cutting links between nodes using iptables. Then it proceeds to make requests concurrently against different nodes while alternately partitioning and healing the network. At the end of the test run, it heals the cluster, waits for the cluster to recover and then verifies whether the intermediate and final state of the system is as expected. It’s actually more complicated but this is enough for the purposes of this discussion.

SolrCloud Architecture in a nutshell

Before we dig into the Jepsen testing, it helps to understand the SolrCloud architecture a bit.

A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different shards or even different collections.

All nodes in a SolrCloud cluster talk to a Apache ZooKeeper ensemble (an odd number of nodes, typically 3 or more for a production deployment). Among other things, ZooKeeper stores:

  1. The cluster state of the system (details of collections, shards, replicas and leaders)
  2. The set of live nodes at any given time (determined by heartbeat messages sent by Solr nodes to ZooKeeper)
  3. The state of each replica (active, recovering or down)
  4. Leader election queues (a queue of live replicas for each shard such that the first in the list attempts to become the leader assuming it fulfils certain conditions)
  5. Configurations (schema etc) which is shared by each replica in a collection.

 

Each Solr replica keeps a Lucene index (partly in memory and disk for performance) and a write-ahead transaction log. An update request to SolrCloud is firstly, written to the transaction log, secondly to the lucene index and thirdly, if the replica is a leader, streamed synchronously to all available replicas of the same shard. A commit command flushes the lucene indexes to disk and makes new documents visible to searchers. Therefore, searches are eventually consistent. Real time “gets” of documents can be done at any time using the document’s unique key, returning content from either the Lucene index (for committed docs) or the transaction log for uncommitted docs. Such real time “gets” are always answered by the leader for consistency.

If a leader dies, a new leader is elected from among the ‘live’ replicas as long as the last published state of the replica was active. The selected replica syncs from all other live replicas (in both directions) to make sure that it has the latest updates among all replicas in the cluster before becoming the leader.

If a leader is not able to send an update to a replica then it updates ZooKeeper to publish the replica as inactive and at the same time spawns a background thread to ask the replica to recover. Both of these actions together make sure that a replica which loses an update neither becomes a leader nor receives traffic from other Solr nodes and smart clients before recovering from a leader node.

Solr nodes (and Solrj clients) forward search requests to active replicas. The active replicas will continue to serve search traffic even if they lose connectivity to ZooKeeper. If the entire ZooKeeper cluster becomes unavailable then the cluster state is effectively frozen which means that searches continue to be served by whoever was active at the time of ZooKeeper going down, but replica recovery cannot happen during this time. However replicas which are not active are allowed to return results if, somehow, a non-distributed request reaches them. This is needed for the sync that happens during leader election as well as useful for debugging at times.

In summary, SolrCloud chooses consistency over availability during writes and it prefers consistency for searches but under severe conditions such as ZooKeeper being unavailable, degrades to possibly serving stale data. In the CAP theorem world, this makes Solr a CP system, with some mitigating heuristics to try and keep availability in certain circumstances.

Jepsen environment

The Jepsen test environment consists of setting up 5 hosts for the data store. We set these up as Linux Containers on an ubuntu host and named them as n1, n2, n3, n4 and n5. Each node runs Java 1.8.0_25 with Solr 4.10.2 and ZooKeeper 3.4.6. Therefore, both ZooKeeper and Solr clusters have 5 nodes each. The reason behind this topology was to make sure that ZooKeeper is partitioned along with the Solr nodes to simulate a more accurate partition where Solr not only loses the ability to communicate with other Solr nodes but might also end up talking to a ZooKeeper node which has lost quorum.

The only configuration change over stock Solr 4.10.2 was to set connection and read timeouts for inter-shard query and update requests. This is important as well as a recommended setting for all production deployments. Here is the complete solr.xml from one of the Solr nodes:

I used the schemaless configuration for Solr. This was uploaded to ZK using the following command:

The Solr nodes were started using the new Solr bin scripts:

Each test would start with deleting collection created by the previous test and then would create a new collection with 5 shards and 3 replicas. We performed the same tests with a collection having just 1 shard and 5 replicas and the results were similar.

Flux, the Clojure client for Solr was used to make requests to Solr. The flux client wraps over the Solrj Java based client and we create the equivalent of a simple dumb HTTP client to Solr i.e. a client which is not cluster aware and has no smart routing logic inside.

We wrote two Jepsen clients for Solr based on what already existed for other data stores:

  1. cas-set-client – This client simulates a set of integers by reading a document using real time get, adding an integer to a multi-valued field and updating it with the last version read from Solr.
  2. create-set-client – This client implements a set of integers by writing a new document for each integer as the unique key.

 

Now that our environment is setup, we are ready to begin testing Solr against different nemeses, starting with the bridge nemesis.

Non-transitive network partitions a.k.a the bridge

The bridge nemesis is best described by it’s official documentation in the code itself:

“A grudge which cuts the network in half, but preserves a node in the middle which has uninterrupted bidirectional connectivity to both components.”

For example, for a five node cluster {n1, n2, n3, n4, n5}, the bridge nemesis will create two equal network partitions {n1, n2} and {n4, n5} such that the node n3 can talk to both partitions. ZooKeeper does the right thing in this scenario and creates a majority partition by combining n3 with either {n1, n2} or with {n4, n5}. Similarly, our SolrCloud cluster is also divided into a majority and minority partition in one of the two configurations above.

So let’s see how SolrCloud fares under such a partition:

Repeated runs yielded similar results. All acknowledged writes were available after partitions had healed (recorded by the ok-fraction in the above output). Some unsuccessful requests (connection and read timeouts) were also successful (recorded in the recovered-fraction).

In this scenario, when trying to send multiple updates with the same version, only one would succeed and others would fail with an error like the following:

So optimistic concurrency is well implemented in SolrCloud and if you specify the _version_ during an update, you are sure to have only one update succeed under concurrency and consistency is preserved under the bridge partition.

What happens when we start sending multiple (new) documents instead of using versioned updates? Since inserts don’t conflict with each other, we should be able to read each insert for which SolrCloud sent an acknowledgement.

The results were as expected:

As soon as a partition happens, we see a few read time outs as in-flight requests are impacted and new leaders are elected. Most failures are due to connection timeouts because of nodes not being able to route requests to leaders that are in another partition.

The interesting thing to note here is that all Solr nodes can talk to at least one ZooKeeper instance (n3) which is part of the majority ZooKeeper partition which means that all nodes are ‘live’ (from ZooKeeper’s perspective) but not necessarily available. This also means that in the face of such a partition, one may suddenly lose the ability to contact some or all replicas (depending on the cluster state) as well as lose the ability to forward the request to the leader from arbitrary nodes in the cluster. Remember that we are using dumb clients which do not route to the leader automatically. Presumably, availability can be a lot better if we use CloudSolrServer which is a cluster aware Solr client that can route requests automatically to the right leader. This is something that we plan to test and report back soon.

Random transitive partitions

Okay, so bridge partitions may be a bit exotic and rare but transitive partitions are more common where the cluster is completely divided into two halves. So we create multiple partitions during the test run and each time a different set of nodes are selected to be cut off from the rest. For example, the set of nodes may be partitioned as {n3, n5} and {n1, n2, n4} in a run creating a majority and minority partition of both ZooKeeper and SolrCloud nodes.

How well does SolrCloud deal when faced with compare-and-set as well as insert operations?

Here’s a typical run for compare-and-set:

Here are the results for inserts: (remember, these are ratios):

In this scenario, we would expect only the majority partition to accept writes and the minority partition should reject all writes because it cannot communicate with ZooKeeper. As we can see from the logs of the create-set-client test above, this is exactly what happens. A few requests time out during the time the leader is elected and then all updates to a node in the minority partition error out with the following:

 

Fixed transitive partitions

Instead of a changing set of isolated nodes, how about a fixed set of nodes which are partitioned each time? Granted that this is somewhat easier but since such a nemesis already existed inside Jepsen, we thought that we should execute it for the sake of completeness.

No problems again for compare-and-set updates:

as well as for inserts:

Single-node partitions

I tried running these tests but Jepsen fails with an error that I haven’t been able to fix yet. @aphyr said that I need password less sudo for it but that doesn’t help either. Given that SolrCloud has had leader failover and ChaosMonkey style tests since day one and that we use zookeeper for state management, and that we had already passed some of the more complex scenarios, I skipped these tests in the interest of time. But it’s still something I’d like to look into later for completeness.

Bugs, bugs and bugs

There weren’t smooth sailings all the time. I did run into bugs, some minor and some not so minor. Some of the issues that were found using Jepsen were:

SOLR-6530 – Commits under network partition can put any node into ‘down’ state

This was a serious bug that was found during our tests. It affects Solr versions after 4.8 and is only experienced by clusters where explicit commit requests are invoked. Users who have autoCommit in their configurations are not affected. I found this bug because the first compare-and-set test implementation used explicit commits to read the latest version of a document before updating it. I now use the real-time-get API which doesn’t require commits. This bug is fixed in Solr 4.10.2.

SOLR-6583 – Resuming connection with ZooKeeper causes log replay

This is a minor issue where resuming connection with ZooKeeper causes log replay. All such replayed updates are discarded because they’re old and already applied. But it can delay a particular node from becoming ‘active’ by a short duration. Most of the times it is not an issue because the log replay code keeps the transaction log file names in memory at startup (it was, after all, written to recover from logs during startup) and never refreshes them. So, most users just get a warning in their logs about missing transaction files.

Requests threads hang during a network partition

I found that a few update request threads would get stuck for as long as a network partition lasts. This happens for requests which hit a network partition after they’ve written the document to local transaction log and lucene index and then fail to update one or more replicas. At this point, some such update threads try to update ZooKeeper to put the failing replicas in the ‘down’ state where they get stuck. Once the partition heals, they return a successful response to the user. However, in the meanwhile another replica has become the leader (because the previous leader was partitioned from ZK) and the old leader recovers from the new leader and throws away the uncommitted updates.

This means that clients which are willing to wait much longer than the ZooKeeper session timeouts *may* get a successful response for a document that is lost. Most clients have sane read timeouts much less than the default ZooKeeper session timeout and will never experience this issue. A related issue was noticed by the community and posted as SOLR-6511 on the public Jira. I am still trying to figure out and fix the root cause.

Cluster Status API fails with weird exceptions during network partitions

The Cluster Status API in Solr fetches the latest cluster state from what Solr calls an “Overseer” node. This is a node elected from among all the nodes in the cluster which is responsible for publishing new cluster state to ZooKeeper. This API uses ZooKeeper as a queue to submit requests and read responses. This ensures that the latest cluster state is returned to the client. However this also means that when talking to a ZooKeeper node which has lost quorum, this status API fails with zookeeper connection loss or session timeout exceptions and a huge stack trace. This is something we can improve on.

SOLR-6610 – Slow cluster startup

This is one of those things that developers often don’t notice as much as users. The first time I brought the Solr cluster online, I noticed that the nodes waited for 30 seconds before joining the cluster. Then I restarted the cluster and it came back immediately. I put a sleep in my init scripts to work around it but made a mental note to investigate it later because I was focused on getting the Jepsen tests run first. I almost forgot about it until someone in the Solr community opened SOLR-6610 which prompted a re-look and it turned out to be a silly bug which caused Solr to wait for it’s own state to be published as down while assuming that an Overseer node existed to process the publish. It only happens when a new cluster is initialized and by definition is seen by users who are bringing up a single node cluster to play with Solr. This is now fixed and will be released in the next bug fix release – Solr 4.10.3.

 

Where’s the code?

All our changes to Jepsen are available on our Jepsen fork at Github in the solr-jepsen branch.

Conclusion

These tests show that there was no data loss during these particular kinds of partitions but as Edgar W. Dijkstra said “Testing shows the presence; not the absence of bugs”. If we did not find the kind of bugs that we’re looking for, that only means that we need to try harder and look in more places.

While Solr may require a little extra work in setting up Zookeeper in the beginning, as you can see by these tests, this allows us to create a significantly more stable environment when it matters most: production. The Solr community made a conscious decision to trade a tiny bit of ease of getting started in exchange for a whole lot of “get finished”. This should result in significantly less data loss and more reliable operations in general.

What’s next?

This is the fun part!

Our next step is to integrate it with the continuous integration environment for Solr and Fusion, our commercial batteries-included platform for search and analytics. We are working through the kinks of our Jenkins setup to have Jepsen run against each subversion commit and notify us of any failures. We plan to make our Jenkins setup public so that the broader Lucene/Solr community can know the impact of their changes on Solr’s stability.

Jepsen is much more powerful than how we’re using it right now. It can skew clocks across the cluster, flip a bit on the disk, slow down networks and wreak all kinds of havoc. We want to test more Solr features like deletes, shard splits, migrates etc. under different nemesis strategies as well as different cluster topologies. Also, we haven’t proved linearizability for compare-and-set operations, just that there’s no data loss in the final state. The knossos checker needs a lot of resources to run so we need to run our test on some large AWS machines to know more.

SolrCloud has come a long way and with Lucidworks’ support and the pooled resources of the huge community, we will continue to build Solr and SolrCloud to be the most stable, scalable and reliable open source search system on the planet.

I want to thank my colleagues, Tim Potter and Matt Hoffman for their help in getting Jepsen to run with Solr. Special mention to Matt Hoffman, our resident Clojure guru, for helping a Clojure newbie like me get up to speed and for answering my incessant questions about clojure syntax and APIs. A big thanks to Matt Mitchell, also my colleague, who wrote flux  – the Clojure Solr client that was used for these tests. I’d like to thank Anshum Gupta, Chris Hostetter, Grant Ingersoll, Jim Walker, Noble Paul, Tim Potter and Steve Rowe for their feedback on this post. It is so great to be able to work with such awesome people on stuff that I love.

Finally, thanks to @aphyr himself for creating Jepsen and writing about network partitions and distributed systems. I greatly enjoyed his posts and learned a lot from them and I think Solr has improved and will continue to be improved because of them.