Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148

Closed
vlerenc opened this issue Apr 17, 2019 · 0 comments · Fixed by gardener/gardener#990
Assignees
Labels
area/control-plane Control plane related component/etcd-backup-restore ETCD Backup & Restore kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) topology/seed Affects Seed clusters
Milestone

Comments

@vlerenc
Copy link
Member

vlerenc commented Apr 17, 2019

When ETCD is offline during maintenance (e.g. worker underneath gets replaced, configuration changes and pod must be rescheduled), the control plane components start crash looping. First the API server because it can't reach ETCD, then the other components because they can't reach the API server. This triggers a cascade of failures from which the control plane, due to prolonged back-off times, will only slowly recover.

Therefore we said, it would make a lot of sense to modify the control plane components, so that they don't fail when ETCD is offline, but actively wait for its availability (even once per second doesn't matter compared to the normal load ETCD must support anyways). This way the impact of ETCD maintenance (configuration change) or rescheduling (OS update) is truly minimal.

However, that may not be possible, because of the liveness/readiness probes. An alternative is a separate controller or code as part of the ETCD backup & restore sidecar that watches over ETCD anyways and will actively delete crash looping control plane components once ETCD is back online.

Eventually, we may decide to implement multi-node #33 or scale-out prior to maintenance and scale-in again afterwards, but this is nothing we can do/reach short-term.

@vlerenc vlerenc added kind/enhancement Enhancement, improvement, extension area/control-plane Control plane related component/etcd-backup-restore ETCD Backup & Restore priority/critical Needs to be resolved soon, because it impacts users negatively topology/shoot Affects Shoot clusters topology/seed Affects Seed clusters and removed topology/shoot Affects Shoot clusters labels Apr 17, 2019
@georgekuruvillak georgekuruvillak self-assigned this Apr 17, 2019
@vlerenc vlerenc changed the title Hold Control Plane Components while ETCD is Offline Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance Apr 18, 2019
@PadmaB PadmaB added this to the 1904b milestone Apr 26, 2019
@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related component/etcd-backup-restore ETCD Backup & Restore kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants