[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of `2` is most vulnerable #641

ishan16696 · 2023-07-14T13:18:11Z

Describe the bug:
It has been observed in between of scaling up(1->3 replicas) a non-HA etcd cluster to HA etcd cluster if second member has join the cluster successful and due to some reasons third member didn’t join the cluster yet(any reason like volume attachment), then .status.clusterSize will become 2 and this is most vulnerable state of etcd cluster as if anyone from the running cluster members(either 1 or 2) goes down then it can cause quorum loss:

If first member goes down then I described this scenario in this issue: [BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. etcd-backup-restore#646 and I already proposed a solution, please check this: [BUG] Restart of first cluster member can lead to quorum loss while scaling up a non-HA etcd cluster. etcd-backup-restore#646 (comment)
But If second member goes down then we might face permanent quorum loss as 1/2 cluster member is down and scale-up annotation gardener.cloud/scaled-to-multi-node still be present in etcd statefulset.

etcd CR status:

  status:
    clusterSize: 2
    conditions:
    - lastTransitionTime: "2023-07-11T01:24:15Z"
      lastUpdateTime: "2023-07-11T02:19:22Z"
      message: At least one member is not ready
      reason: NotAllMembersReady
      status: "False"
      type: AllMembersReady
    - lastTransitionTime: "2023-07-11T01:37:47Z"
      lastUpdateTime: "2023-07-11T02:19:22Z"
      message: Stale snapshot leases. Not renewed in a long time
      reason: BackupFailed
      status: "False"
      type: BackupReady
    - lastTransitionTime: "2023-07-11T01:24:15Z"
      lastUpdateTime: "2023-07-11T02:19:22Z"
      message: The majority of ETCD members is not ready
      reason: QuorumLost
      status: "False"
      type: Ready
    currentReplicas: 1
    etcd:
      apiVersion: apps/v1
      kind: StatefulSet
      name: etcd-main
    members:
    - id: 7768cc11101227e8
      lastTransitionTime: "2023-07-11T01:24:15Z"
      name: etcd-main-0
      reason: UnknownGracePeriodExceeded
      role: Leader
      status: NotReady
    - id: 52e91580c6c42626
      lastTransitionTime: "2023-07-11T01:23:35Z"
      name: etcd-main-1
      reason: LeaseSucceeded
      role: Member
      status: Ready
    observedGeneration: 1
    ready: false
    replicas: 1
    serviceName: etcd-main-client
    updatedReplicas: 2

Expected behaviour:

How To Reproduce (as minimally and precisely as possible):

Logs:

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:
This will not be resolved by harmonise scale-up.

The text was updated successfully, but these errors were encountered:

ishan16696 · 2023-10-26T08:21:40Z

Offline discussion between @shreyas-s-rao @abdasgupta and @ishan16696:
we decided to go ahead with PR: gardener/etcd-backup-restore#661 as it won't introduce any new edge case as well as it won't introduce any new technical debt and member-state custom resource implementation design discussion are currently in-progress. Hence, we decided to go-ahead with this PR: gardener/etcd-backup-restore#661

ishan16696 added the kind/bug Bug label Jul 14, 2023

ishan16696 mentioned this issue Jul 14, 2023

☂️ Edge cases while scaling-up a non-HA to a HA etcd cluster #642

Closed

2 tasks

ishan16696 mentioned this issue Aug 30, 2023

Making etcd-backup-restore restart tolerant in between of scaling-up a cluster. gardener/etcd-backup-restore#661

Merged

shreyas-s-rao assigned ishan16696 Oct 17, 2023

ishan16696 closed this as completed in gardener/etcd-backup-restore#661 Nov 10, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 10, 2023

shreyas-s-rao added this to the v0.21.0 milestone Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of `2` is most vulnerable #641

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of `2` is most vulnerable #641

ishan16696 commented Jul 14, 2023 •

edited

Loading

ishan16696 commented Oct 26, 2023

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of 2 is most vulnerable #641

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of 2 is most vulnerable #641

Comments

ishan16696 commented Jul 14, 2023 • edited Loading

ishan16696 commented Oct 26, 2023

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of `2` is most vulnerable #641

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of `2` is most vulnerable #641

ishan16696 commented Jul 14, 2023 •

edited

Loading