This project is an attempt to simulate network partition within a Kafka cluster and observe the behavior of the cluster. The purpose is to evaluate the durability guarantees provided by a Kafka cluster in case of unreliable network. Those scenarios try to emulate a Kafka stretch cluster over 2 datacenters.
- Kafka can operate properly without Zookeeper as long as there is no need to change the ISR
- The ISR update fails without Zookeeper and thus producer using acks=all would be blocked by out of sync members
- It's possible to have multiple leaders for a single partition in case of network partition, but:
- Only one leader can accept write with _acks=all
- Using acks=1 could lead to data getting truncated
- To ensure durability of messages:
- acks must be set to all, otherwise you might write to an invalid leader
- min.insync.replicas must be set up to ensure that data is replicated on all desired physical location
- If Zookeeper quorum is rebuilt, there is actually no guarantee that the new quorum have all the latest data:
- this could results in data loss or in messages getting duplicated inside leaders
- it's probably better to rely on hierarchical quorum to avoid those issues
- this could results in data loss or in messages getting duplicated inside leaders
3 Kafka brokers: kafka-1, kafka-2 and kafka-3 and one zookeeper.
Current leader is on kafka-1 then kafka-1 blocks all incoming messages from kafka-2, kafka-3 and zookeeper
4 Kafka brokers: kafka-1, kafka-2, kafka-3 and kafka-4 and one zookeeper.
Current leader is currently on kafka-1 then a network partition is simulated:
- on one side kafka-1, kafka-2 and kafka-3
- on another side kafka-4 and zookeeper
4 Kafka brokers: kafka-1, kafka-2, kafka-3 and kafka-4 and 3 zookeeper.
For some reasons, you decide to rebuild the quorum of zookeeper (e.g. you lost a rack or a DC).
There is no guarantee, after rebuilding a quorum, that the nodes have all the required information.
4 Kafka brokers: kafka-1, kafka-2, kafka-3 and kafka-4 and 3 zookeeper.
Simulate a complete network outage between each and every component.
When the network comes back the quorum is reformed, and the cluster is healthy.
Network setup:
- DC-A: kafka-1, kafka-2, ZK-1, ZK-2
- DC-B: kafka-3, kafka-4, ZK-3
We simulate a DC network split.
When the network comes back the quorum is reformed, and the cluster is healthy.
Network setup:
- DC-A: ZK-1, Kafka-1
- DC-B: ZK-2, Kafka-2
- DC-C: ZK-3, Kafka-3
We simulate the following connectivity loss:
- Kafka-1 --> X Kafka-3
- Kafka-2 --> X Kafka-3
All other connections are still up. All partitions where Kafka-3 is the leader are unavailable. If we stop Kafka-3, they are still unavailable as unclean leader election is not enabled and Kafka-3 is the only broker in ISR.