This project contains a Docker image meant to facilitate the deployment of Apache ZooKeeper on Kubernetes using StatefulSets.
- Scaling is not currently supported. An ensemble's membership can not be updated in a safe way in ZooKeeper 3.4.10 (The current stable release).
- Observers are currently not supported. Contributions are welcome.
- Persistent Volumes must be used. emptyDirs will likely result in a loss of data.
The docker image contained in this repository is comprised of a base Ubuntu 16.04 image using the latest release of the OpenJDK JRE based on the 1.8 JVM (JDK 8u111) and the latest stable release of ZooKeeper, 3.4.10. Ubuntu is a much larger image than BusyBox or Alpine, but these images contain mucl or ulibc. This requires a custom version of OpenJDK to be built against a libc runtime other than glibc. No vendor of the ZooKeeper software supplies or verifies the software against such a JVM, and, while Alpine or BusyBox would provide smaller images, we have prioritized a well known environment.
The image is built such that the ZooKeeper process is designated to run as a non-root user. By default, this user is zookeeper. The ZooKeeper package is installed into the /opt/zookeeper directory, all configuration is sym linked into the /usr/etc/zookeeper/, and all executables are sym linked into /usr/bin. The ZooKeeper data directories are contained in /var/lib/zookeeper. This is identical to the RPM distribution that users should be familiar with.
The ZooKeeper Stateful Set requires a Headless Service to control the network domain for the ZooKeeper processes. An example configuration is provided below.
apiVersion: v1
kind: Service
metadata:
name: zk-svc
labels:
app: zk-svc
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk-svc
Note that the Service contains two ports. The server port is used for followers to tail the leaders even log, and the leader-election port is used by the ensemble to perform leader election.
The Stateful Set configuration must match the Headless Service, and it must provide the number of replicas. In the example below we request a ZooKeeper ensemble of size 3. As weighted quorums are not supported, it is imperative that an odd number of replicas be chosen. Moreover, the number of replicas should be either 1, 3, 5, or 7. Ensembles may be scaled to larger membership for read fan out, but, as this will adversely impact write performance, careful thought should be given to selecting a larger value.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zk-svc
replicas: 3
The zkGenConfig.sh script will generate the ZooKeeper configuration (zoo.cfg), Log4J configuration (log4j.properties), and JVM configuration (jvm.env). These will be written to the /opt/zookeeper/conf directory with correct read permissions for the zookeeper user. These files are generated from environment variables that are injected into the container as in the example, minimal configuration below.
containers:
- name: k8szk
imagePullPolicy: Always
image: gcr.io/google_samples/k8szk:v3
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
env:
- name : ZK_ENSEMBLE
value: "zk-0;zk-1;zk-2"
- name: ZK_CLIENT_PORT
value: "2181"
- name: ZK_SERVER_PORT
value: "2888"
- name: ZK_ELECTION_PORT
value: "3888"
Variable | Type | Default | Description |
---|---|---|---|
ZK_ENSEMBLE | string | N/A | A colon separated list of servers in the ensemble. |
This is a mandatory configuration variable that is used to configure the membership of the | |||
ZooKeeper ensemble. It is also used to prevent data loss during accidental scale operations. The | |||
set can be computed as follows. For all integers in the range [0,replicas), prepend the name of | |||
service followed by a dash to the integer. So for the Stateful Set above, the name is zk and we have | |||
3 replicas. for the set {0,1,2} we prepend zk- giving us zk-0;zk-1;zk-2. |
Variable | Type | Default | Description |
---|---|---|---|
ZK_CLIENT_PORT | integer | 2181 | The port on which the server will accept client requests. |
ZK_SERVER_PORT | integer | 2888 | The port on which the leader will send events to followers. |
ZK_ELECTION_PORT | integer | 3888 | The port on which the ensemble performs leader election. |
ZK_MAX_CLIENT_CNXNS | integer | 60 | The maximum number of concurrent client connections that a server in the ensemble will accept. |
The ZK_CLIENT_PORT, ZK_ELECTION_PORT, and ZK_SERVERS_PORT must be set to the containerPorts specified in the container configuration, and the ZK_SERVER_PORT and ZK_ELECTION_PORT must match the Headless Service configuration. However, if the default values of the environment variables are used for both the containerPorts and the Headless Service, the environment variables may be omitted from the configuration.
Variable | Type | Default | Description |
---|---|---|---|
ZK_TICK_TIME | integer | 2000 | The number of wall clock ms that corresponds to a Tick for the ensembles internal time. |
ZK_INIT_LIMIT | integer | 5 | The number of Ticks that an ensemble member is allowed to perform leader election. |
ZK_SYNC_LIMIT | integer | 10 | The number of Tick by which a follower may lag behind the ensembles leader. |
Variable | Type | Default | Description |
---|---|---|---|
ZK_MIN_SESSION_TIMEOUT | integer | 2 * ZK_TICK_TIME | The minimum session timeout that the ensemble will allow a client to request. |
ZK_MAX_SESSION_TIMEOUT | integer | 20 * ZK_TICK_TIME | The maximum session timeout that the ensemble will allow a client to request. |
ZooKeeper does not, by default, purge old transactions logs or snapshots. This can cause the disk to become full. If you have backup procedures and retention policies that rely on external systems, the snapshots can be retrieved manually from the /var/lib/zookeeper/data directory, and the logs can be retrieved manually from the /var/lib/zookeeper/log directory. These will be stored on the persistent volume. The zkCleanup.sh script can be used to manually purge outdated logs and snapshots.
If you do not have an existing retention policy and backup procedure, and if you are comfortable with an automatic procedure, you can use the environment variables below to enable and configure automatic data purge policies.
Variable | Type | Default | Description |
---|---|---|---|
ZK_SNAP_RETAIN_COUNT | integer | 3 | The number of snapshots that the ZooKeeper process will retain if ZK_PURGE_INTERVAL is set to a value greater than 0. |
ZK_PURGE_INTERVAL | integer | 0 | The delay, in hours, between ZooKeeper log and snapshot cleanups. |
Currently the only supported JVM configuration is the JVM heap size. Be sure that the heap size you request does not cause the process to swap out.
Variable | Type | Default | Description |
---|---|---|---|
ZK_HEAP_SIZE | integer | 2 | The JVM heap size in Gibibytes. |
Variable | Type | Default | Description |
---|---|---|---|
ZK_LOG_LEVEL | enum(TRACE,DEBUG,INFO,WARN,ERROR,FATAL) | INFO | The Log Level that for the ZooKeeper processes logger. |
The zkOk.sh script can be used to check the liveness and readiness of ZooKeeper process. The example below demonstrates how to configure liveness and readiness probes for the Pods in the Stateful Set.
readinessProbe:
exec:
command:
- sh
- -c
- "zkOk.sh"
initialDelaySeconds: 15
timeoutSeconds: 5
livenessProbe:
exec:
command:
- sh
- -c
- "zkOk.sh"
initialDelaySeconds: 15
timeoutSeconds: 5
volumeMounts for the container should be defined as below.
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
Currently, the use of Persistent Volumes to provide durable, network attached storage is mandatory. If you use the provided image with emptyDirs, you will likely suffer a data loss. The example below demonstrates how to request a dynamically provisioned persistent volume of 20 GiB.
volumeClaimTemplates:
- metadata:
name: datadir
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
The Log Level configuration may be modified via the ZK_LOG_LEVEL environment variable as described above. However, the location of the log output is not modifiable. The ZooKeeper process must be run in the foreground, and the log information will be shipped to the stdout. This is considered to be a best practice for containerized applications, and it allows users to make use of the log rotation and retention infrastructure that already exists for K8s.
The zkMetrics script can be used to retrieve metrics from the ZooKeeper process and print them to stdout. A recurring Kubernetes job can be used to collect these metrics and provide them to a collector.
bash$ kubectl exec zk-0 zkMetrics.sh
zk_version 3.4.9-1757313, built on 08/23/2016 06:50 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 21
zk_packets_sent 20
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 4
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 27
zk_open_file_descriptor_count 39
zk_max_file_descriptor_count 1048576