[BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. #646

ishan16696 · 2023-07-14T11:01:46Z

Describe the bug:
It has been observed in between of scaling up(1->3 replicas) a non-HA etcd cluster to HA etcd cluster if etcd's first member pod restarted(due to any reason) then this restart of etcd first cluster member in between of scaling-up a cluster might lead to permanent quorum loss when second etcd cluster joined the cluster successfully but third etcd cluster member didn't join the cluster yet or didn’t started yet.
So, now first and third member both are down at the same time which leads to permanent quorum loss.

etcd CR status:

  status:
    clusterSize: 1
    conditions:
    - lastTransitionTime: "2023-07-11T10:16:09Z"
      lastUpdateTime: "2023-07-11T11:12:40Z"
      message: At least one member is not ready
      reason: NotAllMembersReady
      status: "False"
      type: AllMembersReady
    - lastTransitionTime: "2023-07-11T10:29:39Z"
      lastUpdateTime: "2023-07-11T11:12:40Z"
      message: Stale snapshot leases. Not renewed in a long time
      reason: BackupFailed
      status: "False"
      type: BackupReady
    - lastTransitionTime: "2023-07-11T10:16:09Z"
      lastUpdateTime: "2023-07-11T11:12:40Z"
      message: The majority of ETCD members is not ready
      reason: QuorumLost
      status: "False"
      type: Ready
    currentReplicas: 1
    etcd:
      apiVersion: apps/v1
      kind: StatefulSet
      name: etcd-main
    members:
    - id: bd585ad8a06f8cfb
      lastTransitionTime: "2023-07-11T10:16:09Z"
      name: etcd-main-0
      reason: UnknownGracePeriodExceeded
      role: Leader
      status: NotReady
    observedGeneration: 1
    ready: false
    replicas: 1
    serviceName: etcd-main-client
    updatedReplicas: 2

Expected behaviour:
We can't control the restart of etcd pod as restart of etcd member pod could be cause by some infra issue or some after effect but we can avoid this permanent quorum loss from happening.

Logs:
backup-restore logs of etcd-main-0 pod:

2023-07-11T11:10:59.231764125Z stderr F time="2023-07-11T11:10:59Z" level=error msg="Failed to connect to etcd KV client: context deadline exceeded" actor=member-add
2023-07-11T11:10:59.231770196Z stderr F time="2023-07-11T11:10:59Z" level=error msg="unable to check presence of member in cluster: context deadline exceeded" actor=member-add
2023-07-11T11:10:59.235463591Z stderr F time="2023-07-11T11:10:59Z" level=info msg="Etcd cluster scale-up is detected" actor=initializer
.
.
.
2023-07-11T11:12:33.848388409Z stderr F time="2023-07-11T11:12:33Z" level=fatal msg="unable to add a learner in a cluster: error while adding member as a learner: context deadline exceeded" actor=initializer

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-backup-restore version/commit ID:
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:
Failed Prow jobs:

The text was updated successfully, but these errors were encountered:

ishan16696 · 2023-07-14T11:19:53Z

RCA:

It happened because scale-up annotation: gardener.cloud/scaled-to-multi-node present in etcd statefulset which was added by etcd-druid as cluster was marked for scale-up and from the etcd-druid prospective scale-up hasn’t completed yet, hence it won't remove the scale-up annotation until it gets complete.
IMO, the problem is happening due to restart of first etcd cluster member which also detects scale-up(proved by logs) due to scale-up annotation: gardener.cloud/scaled-to-multi-node present in etcd statefulset and this leads to false information passed to etcd's first cluster member that it should be also be added as a learner, hence it takes wrong path of code.

Proposed solution:

IMO, etcd's first cluster member can never be a part of scale-up scenario, so why backup-restore should check the scale-up detection conditions for first etcd cluster member, and that's why in case of first etcd cluster member restart, backup-restore can simply skip checking the scale-up detection conditions and move forward to check data-dir validation and rest of conditions.
By doing this, only a restart will occur and etcd's first cluster member can avoid taking a wrong path and this will move our issue from permanent quorum loss -> transient quorum loss 😄

ishan16696 added the kind/bug Bug label Jul 14, 2023

shreyas-s-rao assigned ishan16696 Jul 24, 2023

shreyas-s-rao mentioned this issue Aug 2, 2023

Improved scale-up detection. gardener/etcd-druid#647

Merged

ishan16696 closed this as completed in #649 Aug 2, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 2, 2023

This was referenced Aug 2, 2023

Skip scale-up checks for first member of etcd cluster. (#649) #654

Merged

Skip scale-up checks for first member of etcd cluster. (#649) #655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. #646

[BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. #646

ishan16696 commented Jul 14, 2023 •

edited

Loading

ishan16696 commented Jul 14, 2023 •

edited

Loading

[BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. #646

[BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. #646

Comments

ishan16696 commented Jul 14, 2023 • edited Loading

ishan16696 commented Jul 14, 2023 • edited Loading

ishan16696 commented Jul 14, 2023 •

edited

Loading

ishan16696 commented Jul 14, 2023 •

edited

Loading