Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148

vlerenc · 2019-04-17T05:12:25Z

When ETCD is offline during maintenance (e.g. worker underneath gets replaced, configuration changes and pod must be rescheduled), the control plane components start crash looping. First the API server because it can't reach ETCD, then the other components because they can't reach the API server. This triggers a cascade of failures from which the control plane, due to prolonged back-off times, will only slowly recover.

Therefore we said, it would make a lot of sense to modify the control plane components, so that they don't fail when ETCD is offline, but actively wait for its availability (even once per second doesn't matter compared to the normal load ETCD must support anyways). This way the impact of ETCD maintenance (configuration change) or rescheduling (OS update) is truly minimal.

However, that may not be possible, because of the liveness/readiness probes. An alternative is a separate controller or code as part of the ETCD backup & restore sidecar that watches over ETCD anyways and will actively delete crash looping control plane components once ETCD is back online.

Eventually, we may decide to implement multi-node #33 or scale-out prior to maintenance and scale-in again afterwards, but this is nothing we can do/reach short-term.

georgekuruvillak self-assigned this Apr 17, 2019

vlerenc changed the title ~~Hold Control Plane Components while ETCD is Offline~~ Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance Apr 18, 2019

PadmaB added this to the 1904b milestone Apr 26, 2019

georgekuruvillak mentioned this issue May 7, 2019

Added dependency-watchdog gardener/gardener#990

Merged

rfranzke closed this as completed in gardener/gardener#990 May 13, 2019

gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148

Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148

vlerenc commented Apr 17, 2019 •

edited

Loading

Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148

Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148

Comments

vlerenc commented Apr 17, 2019 • edited Loading

vlerenc commented Apr 17, 2019 •

edited

Loading