SolrCloud on Docker

This is a follow-up to my Solr on Docker post. For this one, we’ll use a standalone ZooKeeper node, and three SolrCloud nodes, all in their own Docker containers.

Docker version 0.7, build 0d078b6, on Ubuntu 13.04.

ZooKeeper

The current version of ZooKeeper is 3.4.5, and there is a docker-zookeeper project which runs that in a single-node configuration.
If we build and run that in an instance named “zookeeper”:

cd ~
mkdir zookeeper-docker
cd zookeeper-docker
wget https://raw.github.com/jplock/docker-zookeeper/master/Dockerfile
docker build -t makuk66/zookeeper:3.4.5 .
...
Successfully built 26871fd90d0c

docker run -name zookeeper -p 2181 -p 2888 -p 3888 makuk66/zookeeper:3.4.5

We see that ZooKeeper starts running, and after a few seconds we can verify it’s happy:

$ echo ruok | nc -q 2 localhost `docker port zookeeper 2181|sed 's/.*://'`; echo
imok

SolrCloud: Distributed Solr

The current version of Solr is 4.6.0, so we download that:

cd ~
mkdir solr-docker
cd solr-docker
wget http://www.mirrorservice.org/sites/ftp.apache.org/lucene/solr/4.6.0/solr-4.6.0.tgz

This locally cached copy will get added to Docker container at build time.

Create a Docker file:

cat > Dockerfile <<'EOM'
#
# VERSION 0.2

FROM    ubuntu
MAINTAINER  Martijn Koster "mak-docker@greenhills.co.uk"

ENV SOLR solr-4.6.0
RUN mkdir -p /opt
ADD $SOLR.tgz /opt/$SOLR.tgz
RUN tar -C /opt --extract --file /opt/$SOLR.tgz
RUN ln -s /opt/$SOLR /opt/solr

RUN apt-get update
RUN apt-get --yes install openjdk-6-jdk
EXPOSE 8983
CMD ["/bin/bash", "-c", "cd /opt/solr/example; java -jar start.jar"]
EOM

and build:

docker build -rm=true -t makuk66/solr4:4.6.0 .

where makuk66 is my username; substitute your own.

If you don’t want to build your own image, you can pull makuk66/docker-solr, and use makuk66/docker-solr instead of makuk66/solr4:4.6.0 below.

Now we’ll manually run this with docker in the foreground.
The first node bootstraps the collection (like the SolrCloud Example A):

docker run -link zookeeper:ZK -i -p 8983 -t makuk66/solr4:4.6.0
    /bin/bash -c 'cd /opt/solr/example; java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT -DnumShards=2 -jar start.jar'

The -link zookeeper:ZK makes the network information from the node named “zookeeper”
available as environment variables with the ZK_ prefix.

and then the other two start like:

docker run -link zookeeper:ZK -i -p 8983 -t makuk66/solr4:4.6.0
    /bin/bash -c 'cd /opt/solr/example; java -DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT -jar start.jar'

To show all the running containers:

$ docker ps
CONTAINER ID        IMAGE                     COMMAND                CREATED              STATUS              PORTS                                                                       NAMES
1cac635ec128        makuk66/solr4:4.6.0       /bin/bash -c cd /opt   3 seconds ago        Up 2 seconds        0.0.0.0:49158->8983/tcp                                                     prickly_mccarthy
bd23d3891dd6        makuk66/solr4:4.6.0       /bin/bash -c cd /opt   5 seconds ago        Up 4 seconds        0.0.0.0:49157->8983/tcp                                                     high_albattani
365a17a69176        makuk66/solr4:4.6.0       /bin/bash -c cd /opt   About a minute ago   Up About a minute   0.0.0.0:49156->8983/tcp                                                     elegant_bardeen
13805a493a79        makuk66/zookeeper:3.4.5   /opt/zookeeper-3.4.5   25 minutes ago       Up 25 minutes       0.0.0.0:49153->2181/tcp, 0.0.0.0:49154->2888/tcp, 0.0.0.0:49155->3888/tcp   elegant_bardeen/ZK,high_albattani/ZK,prickly_mccarthy/ZK,zookeeper

We can now use one of the exposed ports to look at Solr: http://docker1:49159/solr/#/~cloud,
which shows the 3 Solr nodes in the cluster running on their own internal IP addresses. Neat.

Of course we won’t believe it’s real unless we see search in action.
So let’s run another docker instance to load some data, using the docker host port for one of the nodes above:

docker run -link zookeeper:ZK -i -t makuk66/solr4:4.6.0 /bin/bash
cd /opt/solr/example/exampledocs
java -Durl=http://192.168.0.221:49158/solr/update -jar post.jar *.xml

and search:

apt-get install wget
wget -O - 'http://192.168.0.221:49158/solr/collection1/select?q=solr&wt=xml'

you can do the same directly to the internal address, which you can find using inspect:

docker inspect prickly_mccarthy
wget -O - 'http://172.17.0.37:8983/solr/collection1/select?q=solr&wt=xml'

You can see the shards in action by comparing:

wget -O - 'http://192.168.0.221:49158/solr/collection1/select?q=*:*&wt=xml' | sed 's/.*numFound="//' | sed 's/".*//'
32
wget -O - 'http://192.168.0.221:49158/solr/collection1/select?q=*:*&wt=xml&shards=shard1' | sed 's/.*numFound="//' | sed 's/".*//'
14
wget -O - 'http://192.168.0.221:49158/solr/collection1/select?q=*:*&wt=xml&shards=shard2' | sed 's/.*numFound="//' | sed 's/".*//'
18

Also interesting to try is:

docker diff prickly_mccarthy

to see what changes were made to the filesystem.

Further work

We can do a bunch of further polish here:

  • we should be able to create images rather than specify command lines
  • to allow multiple clusters to co-exist on a single Docker host, we should use something more dynamic than a ‘ZK’ prefix
  • it’d be nice if we had a single script that deployed a whole cluster
  • we should probably use Data Volumes for index storage
  • we may want supervisord/upstart to monitor Java to recover from crashes
  • it might be nice to auto-discover the latest versions of ZooKeeper and Solr and use those
  • if we register containers, we could consider pre-expanding the Solr .war, for sartup speed and to reduce diffs

but those all depend a bit on use-case, and are for another day.

Conclusion

I can really see the value of this approach for certain use-cases.
The resource efficiency, startup speed and cleanliness makes it ideal for proof-of-concept deployments, A/B testing,
and for application developers to use as a local sandbox.

I’m intrigued about production use-cases for this kind of setup. It’s obviously suitable for
multi-tenant deployments, and I’d interested in how you could setup a SolrCloud deployment
across multiple Docker hosts.