Skip to content

etcd defrag + backup: Avoid too many leader changes #384

Closed
@garloff

Description

@garloff

As k8s cluster user, I want the k8s control plane to always be responsive, stable and safe.

We have a nightly job to defragment etcd and back it up on all control plane nodes, randomized a bit, so the defragmentation does not happen all at the same time.
This has been in existence for many months already, but due to a missing --now in systemctl enable, it has not really been active before.
As @matofeder points out, the defragmentation may block access to etcd for a while (seconds on typically sized etcd DBs), causing etcd leader changes (on multi-node etcd clusters) or temporary kube-api failures (on single-node etcd clusters).

Things to consider:

  1. Possibly stronger protection from concurrent defragmentation on multiple etcd nodes by scheduling all etcd nodes from the leader instead of relying on the configured randomness in the timer start time.
  2. Scheduling the leader etcd defragmentation last, as it will likely cause a leader change and we want to minimize these. (Starting with the leader would cause several leader changes ...)
  3. Skipping the leader's defragmentation (for up to a week or infinitely?) to cause less leader changes?
  4. Skipping defragmentation on single-node etcd installations?
  5. Leaving this disabled for R4 and do more real-world tests before R5. (This is not without risk either, we have seen heavily fragmented etcds causing trouble in real-life already.)

Metadata

Metadata

Assignees

Labels

ContainerIssues or pull requests relevant for Team 2: Container Infra and ToolingbugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions