Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] While scaling up a non-HA etcd cluster: intermediate clusterSize of 2 is most vulnerable #641

Closed
Tracked by #642
ishan16696 opened this issue Jul 14, 2023 · 1 comment · Fixed by gardener/etcd-backup-restore#661
Assignees
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@ishan16696
Copy link
Member

ishan16696 commented Jul 14, 2023

Describe the bug:
It has been observed in between of scaling up(1->3 replicas) a non-HA etcd cluster to HA etcd cluster if second member has join the cluster successful and due to some reasons third member didn’t join the cluster yet(any reason like volume attachment), then .status.clusterSize will become 2 and this is most vulnerable state of etcd cluster as if anyone from the running cluster members(either 1 or 2) goes down then it can cause quorum loss:

etcd CR status:

  status:
    clusterSize: 2
    conditions:
    - lastTransitionTime: "2023-07-11T01:24:15Z"
      lastUpdateTime: "2023-07-11T02:19:22Z"
      message: At least one member is not ready
      reason: NotAllMembersReady
      status: "False"
      type: AllMembersReady
    - lastTransitionTime: "2023-07-11T01:37:47Z"
      lastUpdateTime: "2023-07-11T02:19:22Z"
      message: Stale snapshot leases. Not renewed in a long time
      reason: BackupFailed
      status: "False"
      type: BackupReady
    - lastTransitionTime: "2023-07-11T01:24:15Z"
      lastUpdateTime: "2023-07-11T02:19:22Z"
      message: The majority of ETCD members is not ready
      reason: QuorumLost
      status: "False"
      type: Ready
    currentReplicas: 1
    etcd:
      apiVersion: apps/v1
      kind: StatefulSet
      name: etcd-main
    members:
    - id: 7768cc11101227e8
      lastTransitionTime: "2023-07-11T01:24:15Z"
      name: etcd-main-0
      reason: UnknownGracePeriodExceeded
      role: Leader
      status: NotReady
    - id: 52e91580c6c42626
      lastTransitionTime: "2023-07-11T01:23:35Z"
      name: etcd-main-1
      reason: LeaseSucceeded
      role: Member
      status: Ready
    observedGeneration: 1
    ready: false
    replicas: 1
    serviceName: etcd-main-client
    updatedReplicas: 2

Expected behaviour:

How To Reproduce (as minimally and precisely as possible):

Logs:

Environment (please complete the following information):

  • Etcd version/commit ID :
  • Etcd-druid version/commit ID :
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:
This will not be resolved by harmonise scale-up.

@ishan16696
Copy link
Member Author

Offline discussion between @shreyas-s-rao @abdasgupta and @ishan16696:
we decided to go ahead with PR: gardener/etcd-backup-restore#661 as it won't introduce any new edge case as well as it won't introduce any new technical debt and member-state custom resource implementation design discussion are currently in-progress. Hence, we decided to go-ahead with this PR: gardener/etcd-backup-restore#661

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 10, 2023
@shreyas-s-rao shreyas-s-rao added this to the v0.21.0 milestone Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
3 participants