-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase test coverage for recovery from dataloss #9960
Comments
We lack a similar cluster setup to test this. |
We will start with a long running E2E test, as this has the highest coverage and is the closest to the production use case. A second step would be to extend the randomized raft tests to also include simulating data loss, as these will most likely be much easier to diagnose/fix than the E2E tests, reducing the amount of effort we have to put when bugs/edge cases are found. For the E2E test, we'll create a custom cluster plan in INT with the 4 nodes - generations can be configured later to support different replication factors. To simulate data loss, we have to (in order):
When deleting the PVC/PV first, it marks them as to-be-deleted, but waits until the pod is deleted before removing them. This ensures when the pod is recreated, that a fresh PV/PVC will be created. We could also look into the 1.23+ feature around stateful set PVC retention policies if that doesn't work. To control pod restarts, we can add a small init container which busy loops whenever an environment variable is 0. For example: apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zeebe
spec:
template:
spec:
initContainers:
- name: busybox
image: busybox:1.28
# loop for node 0 and 1; add more conditions if you wish to loop for more nodes; remember this is applied to all nodes in the statefulset
command: ['sh', '-c', 'while [ "${K8S_NAME##*-}" -eq 1 ] || [ "${K8S_NAME##*-}" -eq 0 ]; do sleep 1; done']
env:
- name: K8S_NAME # we will extract the node ID from this
valueFrom:
fieldRef:
fieldPath: metadata.name To control restart, just patch the stateful set and edit the condition to loop for the node you want to prevent from fully starting. Make sure to disable the controller's reconcialition if you do this, of course. If eventually the controller supports adding arbitrary init containers, then we could instead make that one dynamic (e.g. pull from some backend whether it needs to start not), but at the moment I think the above will be the simplest. |
To enable the testing, we have added some supporting commands in zbchaos camunda/zeebe-chaos#230 Next step is to automate the test using Testbench
|
…loss and recover (#230) Related to camunda/camunda#9960 This PR adds commands to support testing multi-region dataloss and recovery. We have added commands for - `zbchaos prepare` Prepare the cluster by adding init container to the statefulset and adding a configMap. With the help of init container and the ConfigMap, we can control when a broker that suffered from dataloss is restarted. The configMap contains flags `block_{nodeId} = true|false` corresponding to each nodeId. This is available to InitContainer in the mounted volume via a file corresponding to each flag `/etc/config/block_{nodeId}`. InitContainer is blocked if the flag is set to true. When the configMap is updated, this is reflected in the container eventually. There might be a delay, but eventually `/etc/config/block_{nodeId}` will have the updated value and the InitContainer can break out of the loop. - `zbchaos dataloss delete` Delete the broker and its data. Sets the flag in configMap to true to block the startup in InitContainer. - `zbchaos dataloss recover` Restarts a broker and wait until it has recovered the data. Resets the flag in configMap so that the initContainer can exit and the broker container can start. Also wait until the pods are ready, which is necessary to ensure that the broker have recovered all data. `prepare` is added as a generic command, not part of dataloss, because this can be used to apply other patches (eg:- apply patch for enabling network permissions). We have to run this command only once per cluster, and repeat the tests without re-running prepare. To test the dataloss and recovery, we want to setup a cluster with 4 brokers, and replication factor 4. Node 0 and 1 belongs to region 1 and Node 2 and 3 belongs to region 2. Assuming there is such a cluster in the given namespace, we can simulate the data loss and recovery by running the following commands: 1. zbchaos dataloss prepare --namespace NAMESPACE // Need tor run only once in the namespace 2. zbchaos dataloss delete --namespace NAMESPACE --nodeId 0 3. zbchaos dataloss delete --namespace NAMESPACE --nodeId 1 4. zbchaos recover --namespace NAMESAPCE --nodeId 0 // Wait until one node is fully recovered before recovering the second one 5. zbchaos recover --namespace NAMESPACE --nodeId 1 The above commands simulates full data loss of region 1 and moving the failed pods from region 1 to region 2. You can then repeat steps 2-5 to simulate moving those pods back to region 1. Or run steps 2-5 with nodes 2 and 3 to simulate dataloss of region 2. PS:- It works for clusters in our benchmark deployed via helm. It is not fully tested for SaaS yet. PPS:- This PR contains only the supporting commands. The test is not automated yet.
A new clusterplan "Multiregion test simulation" is available on int which can be used by testbench. This has 4 brokers, 4 partitions and replication factor 4. |
With https://github.com/camunda/zeebe-e2e-test/pull/209 we can now run an e2e test that simulates the deployment of a cluster in two regions and failover to the other region when one region fails with full data loss. |
Everything required for running this test is ready. Next step is to automate running it regularly. My recommendation is to run it once a week, similar to e2e tests, but run it for a shorter time (< 24 hrs). Before automating, it would be better to resolve the following issues:
In addition we have also added a chaos test for production-s cluster types, that would be run daily once zbchaos is up. |
This is now automated with Zbchaos v1.0.0 |
Dataloss is not like other failure cases likes node restarts or network partition. We can only recover from a specific number of data loss.
In general our raft implementation can recover from a data loss if the data loss has happened in <= (replicationFactor - quorum) replicas of a single partition. That means if you have 3 replicas, we can handle data loss of upto 1 node.
However some customers have a special requirement where they have only two data centers available and they must be able to recover from a complete loss of a data center. How to achieve this is outlined here. Since this is now a recommended procedure, we should also test it in our regular QA. In past there were some bugs that were identified in this scenario. Having a regular test would help us to catch the bugs earlier.
In summary, we should test the following scenarios with data loss.
Verify that
We should add a chaos test or a long running e2e+chaos for these scenarios. The tests must run regularly.
To uncover unknown bugs we should also
Related: #9820 , support
The text was updated successfully, but these errors were encountered: