Increase test coverage for recovery from dataloss #9960

deepthidevaki · 2022-08-03T07:33:47Z

Dataloss is not like other failure cases likes node restarts or network partition. We can only recover from a specific number of data loss.

In general our raft implementation can recover from a data loss if the data loss has happened in <= (replicationFactor - quorum) replicas of a single partition. That means if you have 3 replicas, we can handle data loss of upto 1 node.

However some customers have a special requirement where they have only two data centers available and they must be able to recover from a complete loss of a data center. How to achieve this is outlined here. Since this is now a recommended procedure, we should also test it in our regular QA. In past there were some bugs that were identified in this scenario. Having a regular test would help us to catch the bugs earlier.

In summary, we should test the following scenarios with data loss.

A cluster with replication factor 3, data loss happens in a single node.
- We have a raft test that verifies raft node can join the cluster after data loss. But we don't have a test with the whole system.
A cluster with replication factor 4 (simulating deployment in 2 data centers). Data loss happens in 2 nodes from the same data center.
- 2 nodes are restarted (simulating moving the nodes to the healthy data center)
- 2 nodes are restarted with data loss (simulating moving the nodes back to the original data center)

Verify that

All nodes in the cluster are healthy
All partitions have leaders
There is no data loss (Instances that are started before the data loss should continue executing.)

We should add a chaos test or a long running e2e+chaos for these scenarios. The tests must run regularly.
To uncover unknown bugs we should also

Test using different workload (high load, low load, no load)
Repeat the test multiple times. Verify that after the cluster has recovered from data loss, it can handle another data loss.

Related: #9820 , support

megglos · 2022-10-14T09:54:15Z

We lack a similar cluster setup to test this.
We need to evaluate how to break this down/do a PoC.

npepinpe · 2022-10-27T14:06:08Z

We will start with a long running E2E test, as this has the highest coverage and is the closest to the production use case. A second step would be to extend the randomized raft tests to also include simulating data loss, as these will most likely be much easier to diagnose/fix than the E2E tests, reducing the amount of effort we have to put when bugs/edge cases are found.

For the E2E test, we'll create a custom cluster plan in INT with the 4 nodes - generations can be configured later to support different replication factors.

To simulate data loss, we have to (in order):

Delete the PVC
Delete the PV
Delete the pod

When deleting the PVC/PV first, it marks them as to-be-deleted, but waits until the pod is deleted before removing them. This ensures when the pod is recreated, that a fresh PV/PVC will be created. We could also look into the 1.23+ feature around stateful set PVC retention policies if that doesn't work.

To control pod restarts, we can add a small init container which busy loops whenever an environment variable is 0. For example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zeebe
spec:
  template:
    spec:
      initContainers:
        - name: busybox
          image: busybox:1.28
          # loop for node 0 and 1; add more conditions if you wish to loop for more nodes; remember this is applied to all nodes in the statefulset
          command: ['sh', '-c', 'while [ "${K8S_NAME##*-}" -eq 1 ] || [ "${K8S_NAME##*-}" -eq 0 ]; do sleep 1; done']
          env:
            - name: K8S_NAME # we will extract the node ID from this
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

To control restart, just patch the stateful set and edit the condition to loop for the node you want to prevent from fully starting. Make sure to disable the controller's reconcialition if you do this, of course.

If eventually the controller supports adding arbitrary init containers, then we could instead make that one dynamic (e.g. pull from some backend whether it needs to start not), but at the moment I think the above will be the simplest.

deepthidevaki · 2022-11-14T16:04:28Z

To enable the testing, we have added some supporting commands in zbchaos camunda/zeebe-chaos#230

Next step is to automate the test using Testbench

- Allow Testbench to create a cluster with clusterSize 4 and replicationFactor 4
- Define a chaos experiment with dataloss and recovery and add it to e2e test

…loss and recover (#230) Related to camunda/camunda#9960 This PR adds commands to support testing multi-region dataloss and recovery. We have added commands for - `zbchaos prepare` Prepare the cluster by adding init container to the statefulset and adding a configMap. With the help of init container and the ConfigMap, we can control when a broker that suffered from dataloss is restarted. The configMap contains flags `block_{nodeId} = true|false` corresponding to each nodeId. This is available to InitContainer in the mounted volume via a file corresponding to each flag `/etc/config/block_{nodeId}`. InitContainer is blocked if the flag is set to true. When the configMap is updated, this is reflected in the container eventually. There might be a delay, but eventually `/etc/config/block_{nodeId}` will have the updated value and the InitContainer can break out of the loop. - `zbchaos dataloss delete` Delete the broker and its data. Sets the flag in configMap to true to block the startup in InitContainer. - `zbchaos dataloss recover` Restarts a broker and wait until it has recovered the data. Resets the flag in configMap so that the initContainer can exit and the broker container can start. Also wait until the pods are ready, which is necessary to ensure that the broker have recovered all data. `prepare` is added as a generic command, not part of dataloss, because this can be used to apply other patches (eg:- apply patch for enabling network permissions). We have to run this command only once per cluster, and repeat the tests without re-running prepare. To test the dataloss and recovery, we want to setup a cluster with 4 brokers, and replication factor 4. Node 0 and 1 belongs to region 1 and Node 2 and 3 belongs to region 2. Assuming there is such a cluster in the given namespace, we can simulate the data loss and recovery by running the following commands: 1. zbchaos dataloss prepare --namespace NAMESPACE // Need tor run only once in the namespace 2. zbchaos dataloss delete --namespace NAMESPACE --nodeId 0 3. zbchaos dataloss delete --namespace NAMESPACE --nodeId 1 4. zbchaos recover --namespace NAMESAPCE --nodeId 0 // Wait until one node is fully recovered before recovering the second one 5. zbchaos recover --namespace NAMESPACE --nodeId 1 The above commands simulates full data loss of region 1 and moving the failed pods from region 1 to region 2. You can then repeat steps 2-5 to simulate moving those pods back to region 1. Or run steps 2-5 with nodes 2 and 3 to simulate dataloss of region 2. PS:- It works for clusters in our benchmark deployed via helm. It is not fully tested for SaaS yet. PPS:- This PR contains only the supporting commands. The test is not automated yet.

deepthidevaki · 2022-11-30T08:11:16Z

A new clusterplan "Multiregion test simulation" is available on int which can be used by testbench. This has 4 brokers, 4 partitions and replication factor 4.

deepthidevaki · 2022-12-05T09:46:44Z

With https://github.com/camunda/zeebe-e2e-test/pull/209 we can now run an e2e test that simulates the deployment of a cluster in two regions and failover to the other region when one region fails with full data loss.

deepthidevaki · 2022-12-09T08:33:15Z

Everything required for running this test is ready. Next step is to automate running it regularly. My recommendation is to run it once a week, similar to e2e tests, but run it for a shorter time (< 24 hrs).

Before automating, it would be better to resolve the following issues:

Issues with zbchaos Zbchaos worker has concurrency issues zeebe-chaos#288
Streamline Testbench based workflows #11021

In addition we have also added a chaos test for production-s cluster types, that would be run daily once zbchaos is up.

ChrisKujawa · 2023-01-03T11:51:56Z

This is now automated with Zbchaos v1.0.0

deepthidevaki added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Aug 3, 2022

deepthidevaki self-assigned this Oct 14, 2022

megglos assigned npepinpe Oct 14, 2022

deepthidevaki mentioned this issue Nov 14, 2022

Add zbchaos commands for enabling tests to simulate multi-region dataloss and recover camunda/zeebe-chaos#230

Merged

ChrisKujawa added the component/raft label Jan 3, 2023

deepthidevaki mentioned this issue Feb 2, 2023

Run e2e test for multiregion failover weekly #11523

Merged

ghost closed this as completed in 87cc50e Feb 2, 2023

deepthidevaki added the version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 label Mar 7, 2023

npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase test coverage for recovery from dataloss #9960

Increase test coverage for recovery from dataloss #9960

deepthidevaki commented Aug 3, 2022

megglos commented Oct 14, 2022

npepinpe commented Oct 27, 2022 •

edited

Loading

deepthidevaki commented Nov 14, 2022 •

edited

Loading

deepthidevaki commented Nov 30, 2022

deepthidevaki commented Dec 5, 2022

deepthidevaki commented Dec 9, 2022

ChrisKujawa commented Jan 3, 2023 •

edited

Loading

Increase test coverage for recovery from dataloss #9960

Increase test coverage for recovery from dataloss #9960

Comments

deepthidevaki commented Aug 3, 2022

megglos commented Oct 14, 2022

npepinpe commented Oct 27, 2022 • edited Loading

deepthidevaki commented Nov 14, 2022 • edited Loading

deepthidevaki commented Nov 30, 2022

deepthidevaki commented Dec 5, 2022

deepthidevaki commented Dec 9, 2022

ChrisKujawa commented Jan 3, 2023 • edited Loading

npepinpe commented Oct 27, 2022 •

edited

Loading

deepthidevaki commented Nov 14, 2022 •

edited

Loading

ChrisKujawa commented Jan 3, 2023 •

edited

Loading