Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. #646

Closed
Tracked by #642
ishan16696 opened this issue Jul 14, 2023 · 1 comment · Fixed by #649
Assignees
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)

Comments

@ishan16696
Copy link
Member

ishan16696 commented Jul 14, 2023

Describe the bug:
It has been observed in between of scaling up(1->3 replicas) a non-HA etcd cluster to HA etcd cluster if etcd's first member pod restarted(due to any reason) then this restart of etcd first cluster member in between of scaling-up a cluster might lead to permanent quorum loss when second etcd cluster joined the cluster successfully but third etcd cluster member didn't join the cluster yet or didn’t started yet.
So, now first and third member both are down at the same time which leads to permanent quorum loss.

etcd CR status:

  status:
    clusterSize: 1
    conditions:
    - lastTransitionTime: "2023-07-11T10:16:09Z"
      lastUpdateTime: "2023-07-11T11:12:40Z"
      message: At least one member is not ready
      reason: NotAllMembersReady
      status: "False"
      type: AllMembersReady
    - lastTransitionTime: "2023-07-11T10:29:39Z"
      lastUpdateTime: "2023-07-11T11:12:40Z"
      message: Stale snapshot leases. Not renewed in a long time
      reason: BackupFailed
      status: "False"
      type: BackupReady
    - lastTransitionTime: "2023-07-11T10:16:09Z"
      lastUpdateTime: "2023-07-11T11:12:40Z"
      message: The majority of ETCD members is not ready
      reason: QuorumLost
      status: "False"
      type: Ready
    currentReplicas: 1
    etcd:
      apiVersion: apps/v1
      kind: StatefulSet
      name: etcd-main
    members:
    - id: bd585ad8a06f8cfb
      lastTransitionTime: "2023-07-11T10:16:09Z"
      name: etcd-main-0
      reason: UnknownGracePeriodExceeded
      role: Leader
      status: NotReady
    observedGeneration: 1
    ready: false
    replicas: 1
    serviceName: etcd-main-client
    updatedReplicas: 2

Expected behaviour:
We can't control the restart of etcd pod as restart of etcd member pod could be cause by some infra issue or some after effect but we can avoid this permanent quorum loss from happening.

Logs:
backup-restore logs of etcd-main-0 pod:

2023-07-11T11:10:59.231764125Z stderr F time="2023-07-11T11:10:59Z" level=error msg="Failed to connect to etcd KV client: context deadline exceeded" actor=member-add
2023-07-11T11:10:59.231770196Z stderr F time="2023-07-11T11:10:59Z" level=error msg="unable to check presence of member in cluster: context deadline exceeded" actor=member-add
2023-07-11T11:10:59.235463591Z stderr F time="2023-07-11T11:10:59Z" level=info msg="Etcd cluster scale-up is detected" actor=initializer
.
.
.
2023-07-11T11:12:33.848388409Z stderr F time="2023-07-11T11:12:33Z" level=fatal msg="unable to add a learner in a cluster: error while adding member as a learner: context deadline exceeded" actor=initializer

Environment (please complete the following information):

  • Etcd version/commit ID :
  • Etcd-backup-restore version/commit ID:
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:
Failed Prow jobs:

  1. https://gcsweb.gardener.cloud/gcs/gardener-prow/logs/ci-gardener-e2e-kind-ha-multi-zone-upgrade/1678703053483544576/artifacts/
  2. https://gcsweb.gardener.cloud/gcs/gardener-prow/logs/ci-gardener-e2e-kind-ha-single-zone-upgrade-release-v1-74/1679656031631708160/artifacts/
  3. https://gcsweb.gardener.cloud/gcs/gardener-prow/logs/ci-gardener-e2e-kind-ha-single-zone-upgrade-release-v1-75/1679699568997961728/artifacts/
@ishan16696 ishan16696 added the kind/bug Bug label Jul 14, 2023
@ishan16696
Copy link
Member Author

ishan16696 commented Jul 14, 2023

RCA:

  • It happened because scale-up annotation: gardener.cloud/scaled-to-multi-node present in etcd statefulset which was added by etcd-druid as cluster was marked for scale-up and from the etcd-druid prospective scale-up hasn’t completed yet, hence it won't remove the scale-up annotation until it gets complete.
  • IMO, the problem is happening due to restart of first etcd cluster member which also detects scale-up(proved by logs) due to scale-up annotation: gardener.cloud/scaled-to-multi-node present in etcd statefulset and this leads to false information passed to etcd's first cluster member that it should be also be added as a learner, hence it takes wrong path of code.

Proposed solution:

  • IMO, etcd's first cluster member can never be a part of scale-up scenario, so why backup-restore should check the scale-up detection conditions for first etcd cluster member, and that's why in case of first etcd cluster member restart, backup-restore can simply skip checking the scale-up detection conditions and move forward to check data-dir validation and rest of conditions.
  • By doing this, only a restart will occur and etcd's first cluster member can avoid taking a wrong path and this will move our issue from permanent quorum loss -> transient quorum loss 😄
Screenshot 2023-07-14 at 4 35 45 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants