Abort Control Plane Component CrashLoopBackOff when ETCD Returns From Maintenance #148
Labels
area/control-plane
Control plane related
component/etcd-backup-restore
ETCD Backup & Restore
kind/enhancement
Enhancement, improvement, extension
priority/2
Priority (lower number equals higher priority)
topology/seed
Affects Seed clusters
Milestone
When ETCD is offline during maintenance (e.g. worker underneath gets replaced, configuration changes and pod must be rescheduled), the control plane components start crash looping. First the API server because it can't reach ETCD, then the other components because they can't reach the API server. This triggers a cascade of failures from which the control plane, due to prolonged back-off times, will only slowly recover.
Therefore we said, it would make a lot of sense to modify the control plane components, so that they don't fail when ETCD is offline, but actively wait for its availability (even once per second doesn't matter compared to the normal load ETCD must support anyways). This way the impact of ETCD maintenance (configuration change) or rescheduling (OS update) is truly minimal.
However, that may not be possible, because of the liveness/readiness probes. An alternative is a separate controller or code as part of the ETCD backup & restore sidecar that watches over ETCD anyways and will actively delete crash looping control plane components once ETCD is back online.
Eventually, we may decide to implement multi-node #33 or scale-out prior to maintenance and scale-in again afterwards, but this is nothing we can do/reach short-term.
The text was updated successfully, but these errors were encountered: