By Sean Mackrory, Cloudera Inc.
It’s widely accepted that you should never design or implement your own cryptographic algorithms but rather use well-tested, peer-reviewed libraries instead. The same can be said of distributed systems: Making up your own protocols for coordinating a cluster will almost certainly result in frustration and failure.
Architecting a distributed system is not a trivial problem; it is very prone to race conditions, deadlocks, and inconsistency. Making cluster coordination fast and scalable is just as hard as making it reliable. That’s where Apache ZooKeeper, a coordination service that gives you the tools you need to write correct distributed applications, comes in handy.
With ZooKeeper, these difficult problems are solved once, allowing you to build your application without trying to reinvent the wheel. ZooKeeper is already used by Apache HBase, HDFS, and other Apache Hadoop projects to provide highly-available services and, in general, to make distributed programming easier. In this blog post you’ll learn how you can use ZooKeeper to easily and safely implement important features in your distributed software.
How ZooKeeper Works
ZooKeeper runs on a cluster of servers called an ensemble that share the state of your data. (These may be the same machines that are running other Hadoop services or a separate cluster.) Whenever a change is made, it is not considered successful until it has been written to a quorum (at least half) of the servers in the ensemble. A leader is elected within the ensemble, and if two conflicting changes are made at the same time, the one that is processed by the leader first will succeed and the other will fail. ZooKeeper guarantees that writes from the same client will be processed in the order they were sent by that client. This guarantee, along with other features discussed below, allow the system to be used to implement locks, queues, and other important primitives for distributed queueing. The outcome of a write operation allows a node to be certain that an identical write has not succeeded for any other node.
A consequence of the way ZooKeeper works is that a server will disconnect all client sessions any time it has not been able to connect to the quorum for longer than a configurable timeout. The server has no way to tell if the other servers are actually down or if it has just been separated from them due to a network partition, and can therefore no longer guarantee consistency with the rest of the ensemble. As long as more than half of the ensemble is up, the cluster can continue service despite individual server failures. When a failed server is brought back online it is synchronized with the rest of the ensemble and can resume service.
It is best to run your ZooKeeper ensemble with an odd number of server; typical ensemble sizes are three, five, or seven. For instance, if you run five servers and three are down, the cluster will be unavailable (so you can have one server down for maintenance and still survive an unexpected failure). If you run six servers, however, the cluster is still unavailable after three failures but the chance of three simultaneous failures is now slightly higher. Also remember that as you add more servers, you may be able to tolerate more failures, but you also may begin to have lower write throughput. (Apache’s documentation has a nice illustration of the performance characteristics of various ZooKeeper ensemble sizes.)
You need to have Java installed before running ZooKeeper (client bindings are available in several other languages). Cloudera currently recommends Oracle Java 6 for production use, but these examples should work just fine on OpenJDK 6 / 7. You’ll then need to install the correct CDH4 package repository for your system and install the
zookeeper package (required for any machine connecting to ZooKeeper) and the
zookeeper-server package (required for any machine in the ZooKeeper ensemble). Be sure to look at the instructions for your specific system to get the correct URL and commands, but the installation on an Ubuntu system will look something like this:
$ wget http://archive.cloudera.com/.../cdh4-repository_1.0_all.deb $ sudo dpkg -i cdh4-repository_1.0_all.deb $ sudo apt-get update $ sudo apt-get install zookeeper zookeeper-server ... JMX enabled by default Using config: /etc/zookeeper/conf/zoo.cfg ZooKeeper data directory is missing at /var/lib/zookeeper fix the path or run initialize invoke-rc.d: initscript zookeeper-server, action "start" failed.
The warnings you will see indicate that the first time ZooKeeper is run on a given host, it needs to initialize some storage space. You can do that as shown below, and start a ZooKeeper server running in a single-node/standalone configuration.
$ sudo service zookeeper-server init No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-standalone $ sudo service zookeeper-server start JMX enabled by default Using config: /etc/zookeeper/conf/zoo.cfg Starting zookeeper ... STARTED
The ZooKeeper CLI
ZooKeeper comes with a command-line client for interactive use, although in practice you would use one of the programming language bindings directly from your application. We’ll just demonstrate the basic principles of using ZooKeeper with the command-line client.
Launch the client by executing the zookeeper-client command. The initial prompt may be hidden by some log messages, so just hitif you don’t see it. Try typing ls / or help to see the other possible commands.
$ zookeeper-client ... [zk: localhost:2181(CONNECTED) 0] ls / [zookeeper] [zk: localhost:2181(CONNECTED) 1] help
This is similar to the shell and file system from UNIX-like systems. ZooKeeper stores its data in a hierarchy of znodes. Each znode can contain data (like a file) and have children (like a directory). ZooKeeper is intended to work with small chunks of data in each znode: the default limit is 1MB.
Reading and Writing Data
Creating a znode is as easy as specifying the path and the contents. Create an empty znode to serve as a parent ‘directory’, and another znode as its child:
[zk: localhost:2181(CONNECTED) 2] create /zk-demo '' Created /zk-demo [zk: localhost:2181(CONNECTED) 3] create /zk-demo/my-node 'Hello!' Created /zk-demo/my-node
You can then read the contents of these znodes with the get command. The data contained in the znode is printed on the first line, and metadata is listed afterwards (most of the metadata in these examples has been replaced with the textfor brevity).
[zk: localhost:2181(CONNECTED) 4] get /zk-demo/my-node 'Hello!' <metadata> dataVersion = 0 <metadata>
You can, of course, modify znodes after you create them. Notice that the dataVersion values have been modified (as well as the modified timestamp – other metadata has been omitted for brevity).
[zk: localhost:2181(CONNECTED) 5] set /zk-demo/my-node 'Goodbye!' <metadata> dataVersion = 1 <metadata> [zk: localhost:2181(CONNECTED) 6] get /zk-demo/my-node 'Goodbye!' <metadata> dataVersion = 1 <metadata>
You can also delete znodes. znodes that have children cannot be deleted (unless their children are deleted first). There is an rmr command that will do this for you.
[zk: localhost:2181(CONNECTED) 7] delete /zk-demo/my-node [zk: localhost:2181(CONNECTED) 8] ls /zk-demo 
Sequential and Ephemeral znodes
In addition to the standard znode type, there are two special types of znode: sequential and ephemeral. You can create these by passing the -s and -e flags to the create command, respectively. (You can also apply both types to a znode.) Sequential nodes will be created with a numerical suffix appended to the specified name, and ZooKeeper guarantees that two nodes created concurrently will not be given the same number.
[zk: localhost:2181(CONNECTED) 9] create -s /zk-demo/sequential one Created /zk-demo/sequential0000000002 [zk: localhost:2181(CONNECTED) 10] create -s /zk-demo/sequential two Created /zk-demo/sequential0000000003 [zk: localhost:2181(CONNECTED) 11] ls /zk-demo [sequential0000000003, sequential0000000002] [zk: localhost:2181(CONNECTED) 12] get /zk-demo/sequential0000000002 one one <metadata>
Note that the numbering is based on previous children that have had the same parent – so the first sequential node we created was actually # 2. This feature allows you to create distributed mutexes. If a client wants to hold the mutex, it creates a sequential node. If it is then the lowest number znode with that name, it holds the lock. If not, it waits. To release the mutex, it deletes the node, allowing the next znode in order to hold the lock.
You can implement a very simple master election system by making sequential znodes ephemeral. Ephemeral nodes are automatically deleted when the client that created them disconnects (which means that ZooKeeper can also help you with failure detection – another hard problem in distributed systems). Clients can disconnect intentionally when they shut down, or they can be considered disconnected by the cluster because the client exceeded the configured timeout without sending a heartbeat. The node that created the lowest-numbered sequential ephemeral node assumes the “master” role. If the machine crashes, or the JVM pauses too long for garbage collection, the ephemeral node is deleted and the next eligible node can assume its place.
[zk: localhost:2181(CONNECTED) 13] create -e -s /zk-demo/ephemeral data Created /zk-demo/ephemeral [zk: localhost:2181(CONNECTED) 14] ls /zk-demo [sequential0000000003, sequential0000000002, ephemeral0000000003] [zk: localhost:2181(CONNECTED) 15] quit Quitting... $ zookeeper-client Connecting to localhost:2181 Welcome to ZooKeeper! [zk: localhost:2181(CONNECTED) 0] ls /zk-demo [sequential0000000003, sequential0000000002]
ZooKeeper can also notify you of changes in a znode’s content or changes in a znode’s children. To register a “watch” on a znode’s data, you need to use the get or stat commands to access the current content or metadata, and pass an additional parameter requesting the watch. To register a “watch” on a znode’s children, you pass the same parameter when getting the children with ls.
[zk: localhost:2181(CONNECTED) 1] create /zk-demo/watch-this data Created /watch-this [zk: localhost:2181(CONNECTED) 2] get /zk-demo/watch-this true data <metadata>
Modify the same znode (either from the current ZooKeeper client or a separate one), and you will see the following message written to the terminal:
WATCHER:: WatchedEvent state:SyncConnected type:NodeDataChanged path:/watch-this
Note that watches fire only once. If you want to be notified of changes in the future, you must reset the watch each time it fires. Watches allow you to use ZooKeeper to implement asynchronous, event-based systems and to notify nodes when their local copies of the data in ZooKeeper is stale.
Versioning and ACLs
If you look at the metadata listed in previous commands, you will see items that are common in many file systems and features that have been discussed above: creation time, modification time (and corresponding transaction IDs), the size of the contents in bytes, and the creator of the node (if ephemeral). You will also see some metadata for features that help safeguard the integrity and security of the data: data versioning and ACLs. For example:
cZxid = 0x00 ctime = Sun Jan 00 00:00:00 UTC 2013 mZxid = 0x00 mtime = Sun Jan 00 00:00:00 UTC 2013 pZxid = 0x00 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0
The current version of the data is provided every time you read or write to it, and it can also be specified as part of a write command (a test-and-set operation). If a write is attempted with an out-of-date version specified, it will fail – which is useful to make sure you do not overwrite changes that your client has not yet processed.
ZooKeeper also supports the use of Access Control Lists (ACLs) and various authentication systems. ACLs allow you to specify finely-grained permissions to define which users and groups are allowed to create, read, update or delete each znode. Using ACLs in ZooKeeper is beyond the scope of this article, but you can read more about them on the project’s website.
Ideas for Using ZooKeeper
All of the mechanisms described above are accessible through various programming language bindings, allowing you to use ZooKeeper to write better distributed applications. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of these is Apache HBase, which uses ZooKeeper to track the master, the region servers, and the status of data distributed throughout the cluster.
Here are some other examples of how ZooKeeper might be useful to you, and you can find details of the algorithms required for many of these use cases here.
- Group membership and name servicesBy having each node register an ephemeral znode for itself (and any roles it might be fulfilling), you can use ZooKeeper as a replacement for DNS within your cluster. Nodes that go down are automatically removed from the list, and your cluster always has an up-to-date directory of the active nodes.
- Distributed mutexes and master electionWe discussed these potential uses for ZooKeeper above in connection with sequential nodes. These features can help you implement automatic fail-over within your cluster, coordinate concurrent access to resources, and make other decisions in your cluster safely.
- Asynchronous message passing and event broadcastingAlthough other tools are better suited to message passing when throughput is the main concern, I’ve found ZooKeeper to be quite useful for building a simple pub/sub system when needed. In one case, a cluster needed a long sequence of actions to be performed in the hours after a node was added or removed in the cluster. On demand, the sequence of actions was loaded into ZooKeeper as a group of sequential nodes, forming a queue. The “master” node processed each action at the designated time and in the correct order. The process took several hours and there was a chance that the master node might crash or be decommissioned during that time. Because ZooKeeper recorded the progress on each action, another node could pick up where the master left off in the event of any problem.
- Centralized configuration managementUsing ZooKeeper to store your configuration information has two main benefits. First, new nodes only need to be told how to connect to ZooKeeper and can then download all other configuration information and determine the role they should play in the cluster for themselves. Second, your application can subscribe to changes in the configuration, allowing you to tweak the configuration through a ZooKeeper client and modify the cluster’s behavior at run-time.
Done wrong, distributed software can be the most difficult to debug. Done right, however, it can allow you to process more data, more reliably and in less time. Using Apache ZooKeeper allows you to confidently reason about the state of your data, and coordinate your cluster the right way. You’ve seen how easy it is to get a ZooKeeper server up and running. (In fact, if you’re a CDH user, you may already have an ensemble running!) Think about how ZooKeeper could help you build more robust systems, and have a look at CDH documentation and Apache’s project website for more information about running these packages in a production environment.