Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase test coverage for recovery from dataloss #9960

Closed
deepthidevaki opened this issue Aug 3, 2022 · 7 comments · Fixed by #11523
Closed

Increase test coverage for recovery from dataloss #9960

deepthidevaki opened this issue Aug 3, 2022 · 7 comments · Fixed by #11523
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/raft kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0

Comments

@deepthidevaki
Copy link
Contributor

Dataloss is not like other failure cases likes node restarts or network partition. We can only recover from a specific number of data loss.

In general our raft implementation can recover from a data loss if the data loss has happened in <= (replicationFactor - quorum) replicas of a single partition. That means if you have 3 replicas, we can handle data loss of upto 1 node.

However some customers have a special requirement where they have only two data centers available and they must be able to recover from a complete loss of a data center. How to achieve this is outlined here. Since this is now a recommended procedure, we should also test it in our regular QA. In past there were some bugs that were identified in this scenario. Having a regular test would help us to catch the bugs earlier.

In summary, we should test the following scenarios with data loss.

  1. A cluster with replication factor 3, data loss happens in a single node.
    • We have a raft test that verifies raft node can join the cluster after data loss. But we don't have a test with the whole system.
  2. A cluster with replication factor 4 (simulating deployment in 2 data centers). Data loss happens in 2 nodes from the same data center.
    • 2 nodes are restarted (simulating moving the nodes to the healthy data center)
    • 2 nodes are restarted with data loss (simulating moving the nodes back to the original data center)

Verify that

  1. All nodes in the cluster are healthy
  2. All partitions have leaders
  3. There is no data loss (Instances that are started before the data loss should continue executing.)

We should add a chaos test or a long running e2e+chaos for these scenarios. The tests must run regularly.
To uncover unknown bugs we should also

  • Test using different workload (high load, low load, no load)
  • Repeat the test multiple times. Verify that after the cluster has recovered from data loss, it can handle another data loss.

Related: #9820 , support

@deepthidevaki deepthidevaki added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Aug 3, 2022
@deepthidevaki deepthidevaki self-assigned this Oct 14, 2022
@megglos
Copy link
Contributor

megglos commented Oct 14, 2022

We lack a similar cluster setup to test this.
We need to evaluate how to break this down/do a PoC.

@npepinpe
Copy link
Member

npepinpe commented Oct 27, 2022

We will start with a long running E2E test, as this has the highest coverage and is the closest to the production use case. A second step would be to extend the randomized raft tests to also include simulating data loss, as these will most likely be much easier to diagnose/fix than the E2E tests, reducing the amount of effort we have to put when bugs/edge cases are found.

For the E2E test, we'll create a custom cluster plan in INT with the 4 nodes - generations can be configured later to support different replication factors.

To simulate data loss, we have to (in order):

  1. Delete the PVC
  2. Delete the PV
  3. Delete the pod

When deleting the PVC/PV first, it marks them as to-be-deleted, but waits until the pod is deleted before removing them. This ensures when the pod is recreated, that a fresh PV/PVC will be created. We could also look into the 1.23+ feature around stateful set PVC retention policies if that doesn't work.

To control pod restarts, we can add a small init container which busy loops whenever an environment variable is 0. For example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zeebe
spec:
  template:
    spec:
      initContainers:
        - name: busybox
          image: busybox:1.28
          # loop for node 0 and 1; add more conditions if you wish to loop for more nodes; remember this is applied to all nodes in the statefulset
          command: ['sh', '-c', 'while [ "${K8S_NAME##*-}" -eq 1 ] || [ "${K8S_NAME##*-}" -eq 0 ]; do sleep 1; done']
          env:
            - name: K8S_NAME # we will extract the node ID from this
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

To control restart, just patch the stateful set and edit the condition to loop for the node you want to prevent from fully starting. Make sure to disable the controller's reconcialition if you do this, of course.

If eventually the controller supports adding arbitrary init containers, then we could instead make that one dynamic (e.g. pull from some backend whether it needs to start not), but at the moment I think the above will be the simplest.

@deepthidevaki
Copy link
Contributor Author

deepthidevaki commented Nov 14, 2022

To enable the testing, we have added some supporting commands in zbchaos camunda/zeebe-chaos#230

Next step is to automate the test using Testbench

  • - Allow Testbench to create a cluster with clusterSize 4 and replicationFactor 4
  • - Define a chaos experiment with dataloss and recovery and add it to e2e test

deepthidevaki added a commit to camunda/zeebe-chaos that referenced this issue Nov 16, 2022
…loss and recover (#230)

Related to camunda/camunda#9960 

This PR adds commands to support testing multi-region dataloss and
recovery. We have added commands for
- `zbchaos prepare` Prepare the cluster by adding init container to the
statefulset and adding a configMap. With the help of init container and
the ConfigMap, we can control when a broker that suffered from dataloss
is restarted.
The configMap contains flags `block_{nodeId} = true|false` corresponding
to each nodeId. This is available to InitContainer in the mounted volume
via a file corresponding to each flag `/etc/config/block_{nodeId}`.
InitContainer is blocked if the flag is set to true. When the configMap
is updated, this is reflected in the container eventually. There might
be a delay, but eventually `/etc/config/block_{nodeId}` will have the
updated value and the InitContainer can break out of the loop.
- `zbchaos dataloss delete` Delete the broker and its data. Sets the
flag in configMap to true to block the startup in InitContainer.
- `zbchaos dataloss recover` Restarts a broker and wait until it has
recovered the data. Resets the flag in configMap so that the
initContainer can exit and the broker container can start. Also wait
until the pods are ready, which is necessary to ensure that the broker
have recovered all data.

`prepare` is added as a generic command, not part of dataloss, because
this can be used to apply other patches (eg:- apply patch for enabling
network permissions). We have to run this command only once per cluster,
and repeat the tests without re-running prepare.

To test the dataloss and recovery, we want to setup a cluster with 4
brokers, and replication factor 4. Node 0 and 1 belongs to region 1 and
Node 2 and 3 belongs to region 2. Assuming there is such a cluster in
the given namespace, we can simulate the data loss and recovery by
running the following commands:
1. zbchaos dataloss prepare --namespace NAMESPACE // Need tor run only once in
the namespace
2. zbchaos dataloss delete --namespace NAMESPACE --nodeId 0
3. zbchaos dataloss delete --namespace NAMESPACE --nodeId 1 
4. zbchaos recover --namespace NAMESAPCE --nodeId 0 // Wait until one
node is fully recovered before recovering the second one
5. zbchaos recover --namespace NAMESPACE --nodeId 1 

The above commands simulates full data loss of region 1 and moving the
failed pods from region 1 to region 2. You can then repeat steps 2-5 to
simulate moving those pods back to region 1. Or run steps 2-5 with nodes
2 and 3 to simulate dataloss of region 2.

PS:- It works for clusters in our benchmark deployed via helm. It is not
fully tested for SaaS yet.

PPS:- This PR contains only the supporting commands. The test is not
automated yet.
@deepthidevaki
Copy link
Contributor Author

A new clusterplan "Multiregion test simulation" is available on int which can be used by testbench. This has 4 brokers, 4 partitions and replication factor 4.

@deepthidevaki
Copy link
Contributor Author

With https://github.com/camunda/zeebe-e2e-test/pull/209 we can now run an e2e test that simulates the deployment of a cluster in two regions and failover to the other region when one region fails with full data loss.

@deepthidevaki
Copy link
Contributor Author

Everything required for running this test is ready. Next step is to automate running it regularly. My recommendation is to run it once a week, similar to e2e tests, but run it for a shorter time (< 24 hrs).

Before automating, it would be better to resolve the following issues:

  1. Issues with zbchaos Zbchaos worker has concurrency issues zeebe-chaos#288
  2. Streamline Testbench based workflows #11021

In addition we have also added a chaos test for production-s cluster types, that would be run daily once zbchaos is up.

@ChrisKujawa
Copy link
Member

ChrisKujawa commented Jan 3, 2023

This is now automated with Zbchaos v1.0.0

@ghost ghost closed this as completed in 87cc50e Feb 2, 2023
@deepthidevaki deepthidevaki added the version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 label Mar 7, 2023
@npepinpe npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/raft kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants