By Sean Mackrory, Cloudera Inc.

It’s widely accepted that you should never design or implement your own cryptographic algorithms but rather use well-tested, peer-reviewed libraries instead. The same can be said of distributed systems: Making up your own protocols for coordinating a cluster will almost certainly result in frustration and failure.

Architecting a distributed system is not a trivial problem; it is very prone to race conditions, deadlocks, and inconsistency. Making cluster coordination fast and scalable is just as hard as making it reliable. That’s where Apache ZooKeeper, a coordination service that gives you the tools you need to write correct distributed applications, comes in handy.

With ZooKeeper, these difficult problems are solved once, allowing you to build your application without trying to reinvent the wheel. ZooKeeper is already used by Apache HBase, HDFS, and other Apache Hadoop projects to provide highly-available services and, in general, to make distributed programming easier. In this blog post you’ll learn how you can use ZooKeeper to easily and safely implement important features in your distributed software.

How ZooKeeper Works

ZooKeeper runs on a cluster of servers called an ensemble that share the state of your data. (These may be the same machines that are running other Hadoop services or a separate cluster.) Whenever a change is made, it is not considered successful until it has been written to a quorum (at least half) of the servers in the ensemble. A leader is elected within the ensemble, and if two conflicting changes are made at the same time, the one that is processed by the leader first will succeed and the other will fail. ZooKeeper guarantees that writes from the same client will be processed in the order they were sent by that client. This guarantee, along with other features discussed below, allow the system to be used to implement locks, queues, and other important primitives for distributed queueing. The outcome of a write operation allows a node to be certain that an identical write has not succeeded for any other node.

A consequence of the way ZooKeeper works is that a server will disconnect all client sessions any time it has not been able to connect to the quorum for longer than a configurable timeout. The server has no way to tell if the other servers are actually down or if it has just been separated from them due to a network partition, and can therefore no longer guarantee consistency with the rest of the ensemble. As long as more than half of the ensemble is up, the cluster can continue service despite individual server failures. When a failed server is brought back online it is synchronized with the rest of the ensemble and can resume service.

It is best to run your ZooKeeper ensemble with an odd number of server; typical ensemble sizes are three, five, or seven. For instance, if you run five servers and three are down, the cluster will be unavailable (so you can have one server down for maintenance and still survive an unexpected failure). If you run six servers, however, the cluster is still unavailable after three failures but the chance of three simultaneous failures is now slightly higher. Also remember that as you add more servers, you may be able to tolerate more failures, but you also may begin to have lower write throughput. (Apache’s documentation has a nice illustration of the performance characteristics of various ZooKeeper ensemble sizes.)

Installing ZooKeeper

You need to have Java installed before running ZooKeeper (client bindings are available in several other languages). Cloudera currently recommends Oracle Java 6 for production use, but these examples should work just fine on OpenJDK 6 / 7. You’ll then need to install the correct CDH4 package repository for your system and install the zookeeper package (required for any machine connecting to ZooKeeper) and the zookeeper-server package (required for any machine in the ZooKeeper ensemble). Be sure to look at the instructions for your specific system to get the correct URL and commands, but the installation on an Ubuntu system will look something like this:

$ wget http://archive.cloudera.com/.../cdh4-repository_1.0_all.deb
$ sudo dpkg -i cdh4-repository_1.0_all.deb
$ sudo apt-get update
$ sudo apt-get install zookeeper zookeeper-server
...
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
ZooKeeper data directory is missing at /var/lib/zookeeper fix the path or run initialize
invoke-rc.d: initscript zookeeper-server, action "start" failed.

The warnings you will see indicate that the first time ZooKeeper is run on a given host, it needs to initialize some storage space. You can do that as shown below, and start a ZooKeeper server running in a single-node/standalone configuration.

$ sudo service zookeeper-server init
No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-standalone
$ sudo service zookeeper-server start
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

The ZooKeeper CLI

ZooKeeper comes with a command-line client for interactive use, although in practice you would use one of the programming language bindings directly from your application. We’ll just demonstrate the basic principles of using ZooKeeper with the command-line client.

Launch the client by executing the zookeeper-client command. The initial prompt may be hidden by some log messages, so just hitif you don’t see it. Try typing ls / or help to see the other possible commands.

$ zookeeper-client
  ...
  [zk: localhost:2181(CONNECTED) 0] ls /
  [zookeeper]
  [zk: localhost:2181(CONNECTED) 1] help

This is similar to the shell and file system from UNIX-like systems. ZooKeeper stores its data in a hierarchy of znodes. Each znode can contain data (like a file) and have children (like a directory). ZooKeeper is intended to work with small chunks of data in each znode: the default limit is 1MB.

Reading and Writing Data

Creating a znode is as easy as specifying the path and the contents. Create an empty znode to serve as a parent ‘directory’, and another znode as its child:

[zk: localhost:2181(CONNECTED) 2] create /zk-demo ''
Created /zk-demo
[zk: localhost:2181(CONNECTED) 3] create /zk-demo/my-node 'Hello!'
Created /zk-demo/my-node

You can then read the contents of these znodes with the get command. The data contained in the znode is printed on the first line, and metadata is listed afterwards (most of the metadata in these examples has been replaced with the textfor brevity).

[zk: localhost:2181(CONNECTED) 4] get /zk-demo/my-node
'Hello!'
<metadata>
dataVersion = 0
<metadata>

You can, of course, modify znodes after you create them. Notice that the dataVersion values have been modified (as well as the modified timestamp – other metadata has been omitted for brevity).

[zk: localhost:2181(CONNECTED) 5] set /zk-demo/my-node 'Goodbye!'
<metadata>
dataVersion = 1
<metadata>
[zk: localhost:2181(CONNECTED) 6] get /zk-demo/my-node
'Goodbye!'
<metadata>
dataVersion = 1
<metadata>

You can also delete znodes. znodes that have children cannot be deleted (unless their children are deleted first). There is an rmr command that will do this for you.

[zk: localhost:2181(CONNECTED) 7] delete /zk-demo/my-node 
[zk: localhost:2181(CONNECTED) 8] ls /zk-demo
[]

Sequential and Ephemeral znodes

In addition to the standard znode type, there are two special types of znode: sequential and ephemeral. You can create these by passing the -s and -e flags to the create command, respectively. (You can also apply both types to a znode.) Sequential nodes will be created with a numerical suffix appended to the specified name, and ZooKeeper guarantees that two nodes created concurrently will not be given the same number.

[zk: localhost:2181(CONNECTED) 9] create -s /zk-demo/sequential one
Created /zk-demo/sequential0000000002
[zk: localhost:2181(CONNECTED) 10] create -s /zk-demo/sequential two
Created /zk-demo/sequential0000000003
[zk: localhost:2181(CONNECTED) 11] ls /zk-demo
[sequential0000000003, sequential0000000002]
[zk: localhost:2181(CONNECTED) 12] get /zk-demo/sequential0000000002 one
one
<metadata>

Note that the numbering is based on previous children that have had the same parent – so the first sequential node we created was actually # 2. This feature allows you to create distributed mutexes. If a client wants to hold the mutex, it creates a sequential node. If it is then the lowest number znode with that name, it holds the lock. If not, it waits. To release the mutex, it deletes the node, allowing the next znode in order to hold the lock.

You can implement a very simple master election system by making sequential znodes ephemeral. Ephemeral nodes are automatically deleted when the client that created them disconnects (which means that ZooKeeper can also help you with failure detection – another hard problem in distributed systems). Clients can disconnect intentionally when they shut down, or they can be considered disconnected by the cluster because the client exceeded the configured timeout without sending a heartbeat. The node that created the lowest-numbered sequential ephemeral node assumes the “master” role. If the machine crashes, or the JVM pauses too long for garbage collection, the ephemeral node is deleted and the next eligible node can assume its place.

[zk: localhost:2181(CONNECTED) 13] create -e -s /zk-demo/ephemeral data
Created /zk-demo/ephemeral
[zk: localhost:2181(CONNECTED) 14] ls /zk-demo
[sequential0000000003, sequential0000000002, ephemeral0000000003]
[zk: localhost:2181(CONNECTED) 15] quit
Quitting...
$ zookeeper-client
Connecting to localhost:2181
Welcome to ZooKeeper!
[zk: localhost:2181(CONNECTED) 0] ls /zk-demo
[sequential0000000003, sequential0000000002]

Watches

ZooKeeper can also notify you of changes in a znode’s content or changes in a znode’s children. To register a “watch” on a znode’s data, you need to use the get or stat commands to access the current content or metadata, and pass an additional parameter requesting the watch. To register a “watch” on a znode’s children, you pass the same parameter when getting the children with ls.

[zk: localhost:2181(CONNECTED) 1] create /zk-demo/watch-this data
Created /watch-this
[zk: localhost:2181(CONNECTED) 2] get /zk-demo/watch-this true
data
<metadata>

Modify the same znode (either from the current ZooKeeper client or a separate one), and you will see the following message written to the terminal:

WATCHER::

WatchedEvent state:SyncConnected type:NodeDataChanged path:/watch-this

Note that watches fire only once. If you want to be notified of changes in the future, you must reset the watch each time it fires. Watches allow you to use ZooKeeper to implement asynchronous, event-based systems and to notify nodes when their local copies of the data in ZooKeeper is stale.

Versioning and ACLs

If you look at the metadata listed in previous commands, you will see items that are common in many file systems and features that have been discussed above: creation time, modification time (and corresponding transaction IDs), the size of the contents in bytes, and the creator of the node (if ephemeral). You will also see some metadata for features that help safeguard the integrity and security of the data: data versioning and ACLs. For example:

cZxid = 0x00
ctime = Sun Jan 00 00:00:00 UTC 2013
mZxid = 0x00
mtime = Sun Jan 00 00:00:00 UTC 2013
pZxid = 0x00
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0

The current version of the data is provided every time you read or write to it, and it can also be specified as part of a write command (a test-and-set operation). If a write is attempted with an out-of-date version specified, it will fail – which is useful to make sure you do not overwrite changes that your client has not yet processed.

ZooKeeper also supports the use of Access Control Lists (ACLs) and various authentication systems. ACLs allow you to specify finely-grained permissions to define which users and groups are allowed to create, read, update or delete each znode. Using ACLs in ZooKeeper is beyond the scope of this article, but you can read more about them on the project’s website.

Ideas for Using ZooKeeper

All of the mechanisms described above are accessible through various programming language bindings, allowing you to use ZooKeeper to write better distributed applications. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of these is Apache HBase, which uses ZooKeeper to track the master, the region servers, and the status of data distributed throughout the cluster.

Here are some other examples of how ZooKeeper might be useful to you, and you can find details of the algorithms required for many of these use cases here.

    • Group membership and name servicesBy having each node register an ephemeral znode for itself (and any roles it might be fulfilling), you can use ZooKeeper as a replacement for DNS within your cluster. Nodes that go down are automatically removed from the list, and your cluster always has an up-to-date directory of the active nodes.
    • Distributed mutexes and master electionWe discussed these potential uses for ZooKeeper above in connection with sequential nodes. These features can help you implement automatic fail-over within your cluster, coordinate concurrent access to resources, and make other decisions in your cluster safely.
    • Asynchronous message passing and event broadcastingAlthough other tools are better suited to message passing when throughput is the main concern, I’ve found ZooKeeper to be quite useful for building a simple pub/sub system when needed. In one case, a cluster needed a long sequence of actions to be performed in the hours after a node was added or removed in the cluster. On demand, the sequence of actions was loaded into ZooKeeper as a group of sequential nodes, forming a queue. The “master” node processed each action at the designated time and in the correct order. The process took several hours and there was a chance that the master node might crash or be decommissioned during that time. Because ZooKeeper recorded the progress on each action, another node could pick up where the master left off in the event of any problem.
    • Centralized configuration managementUsing ZooKeeper to store your configuration information has two main benefits. First, new nodes only need to be told how to connect to ZooKeeper and can then download all other configuration information and determine the role they should play in the cluster for themselves. Second, your application can subscribe to changes in the configuration, allowing you to tweak the configuration through a ZooKeeper client and modify the cluster’s behavior at run-time.

Wrapping Up

Done wrong, distributed software can be the most difficult to debug. Done right, however, it can allow you to process more data, more reliably and in less time. Using Apache ZooKeeper allows you to confidently reason about the state of your data, and coordinate your cluster the right way. You’ve seen how easy it is to get a ZooKeeper server up and running. (In fact, if you’re a CDH user, you may already have an ensemble running!) Think about how ZooKeeper could help you build more robust systems, and have a look at CDH documentation and Apache’s project website for more information about running these packages in a production environment.

About Lucidworks

Read more from this author

Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter! Each month, we handpick the top stories, insights, and updates to keep you in the know.