Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd defrag + backup: Avoid too many leader changes #384

Closed
garloff opened this issue Mar 14, 2023 · 12 comments · Fixed by #401
Closed

etcd defrag + backup: Avoid too many leader changes #384

garloff opened this issue Mar 14, 2023 · 12 comments · Fixed by #401
Assignees
Labels
bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling
Milestone

Comments

@garloff
Copy link
Contributor

garloff commented Mar 14, 2023

As k8s cluster user, I want the k8s control plane to always be responsive, stable and safe.

We have a nightly job to defragment etcd and back it up on all control plane nodes, randomized a bit, so the defragmentation does not happen all at the same time.
This has been in existence for many months already, but due to a missing --now in systemctl enable, it has not really been active before.
As @matofeder points out, the defragmentation may block access to etcd for a while (seconds on typically sized etcd DBs), causing etcd leader changes (on multi-node etcd clusters) or temporary kube-api failures (on single-node etcd clusters).

Things to consider:

  1. Possibly stronger protection from concurrent defragmentation on multiple etcd nodes by scheduling all etcd nodes from the leader instead of relying on the configured randomness in the timer start time.
  2. Scheduling the leader etcd defragmentation last, as it will likely cause a leader change and we want to minimize these. (Starting with the leader would cause several leader changes ...)
  3. Skipping the leader's defragmentation (for up to a week or infinitely?) to cause less leader changes?
  4. Skipping defragmentation on single-node etcd installations?
  5. Leaving this disabled for R4 and do more real-world tests before R5. (This is not without risk either, we have seen heavily fragmented etcds causing trouble in real-life already.)
@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

To point 1: It should be noted that the backup and fstrim operations can not be done centrally but would need a local job on the non-leader control-plane nodes, possibly triggered via ssh.

@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

Point 2 is easy to do.

@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

Points 3+4: If we decide to simply skip defrag on the etcd leader (infinitely), we'd cover both of these points with a single change.

@garloff garloff added bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling labels Mar 14, 2023
@garloff garloff added this to the R4 (v5.0.0) milestone Mar 14, 2023
@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

Your feedback is welcome, especially @matofeder, @chess-knight, @batistein, @janiskemper, @ajfriesen, @flyersa, @curx, @fynluk, @mxmxchere.

@batistein
Copy link
Contributor

@guettli this is what we talked about. maybe we could contribute here

@guettli
Copy link

guettli commented Mar 14, 2023

I think the best solution would be an official solution from etcd.io.

Maybe it is enough to update the docs: I created an issue for that: etcd-io/etcd#15477

Maybe it is enough to skip defragmentation if the current instance is the leader. After N hours the cron-job would try it again. AFAIK defragmentation is not that pressing, so that this simple solution might be ok.

There is a K8s cronjob etcd-defrag-cronjob, but this depends on Prometheus. I personally would prefer a solution which does not depend on a third party tool like prometheus.

@matofeder
Copy link
Member

matofeder commented Mar 14, 2023

There is a K8s cronjob etcd-defrag-cronjob, but this depends on Prometheus. I personally would prefer a solution which does not depend on a third party tool like prometheus.

I agree. But the script from the etcd-defrag-cronjob project could be adopted as a solid base for our implementation.

I would propose the following (with the respect to points 1..3 in the description):

Write a short script that will be deployed to each control plane node and executed periodically (at the same time), e.g. the etcd-defrag.service could be adjusted here and instead of etcdctl ... defrag, e.g. the defrag.sh script will be executed periodically. The script does the following:

  1. Check if the control plane node where the script is executed is an ETCD leader node
  2. Finish if the node is not ETCD leader
  3. Executes the following if the node is ETCD leader:
    • Defrag all non-leader ETCD nodes, step by step
    • Change the leader node with etcdctl move-leader to the defragmentation completed ETCD node (optional, but we don't have to wait election-timeout, which could eliminate the loss of writes already sent to the old leader , see docs)
    • Defrag the node where the script is executed (currently not a leader)

Next periodic execution does the same.

  1. Skipping defragmentation on single-node etcd installations?

I would say YES as this may cause some unwanted issues on the client side. Optionally, a "special" flag could be introduced to allow this, e.g. --allow-etcd-single-node-defrag

@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

I like the described approach by @matofeder.

I have not yet looked at the implementation of defrag.sh and how it compares to what we are currently doing.

In addition to defragmenting, we do create a local backup (for manual disaster recovery) and do a discard (fstrim) on the filesystem to combat FTL fragmentation. I think both are useful. Both are also fine to only happen on the previous leader as we will change the leader on a daily basis at least (in absence of other leader-changing events). We would need to ensure that the leader changes don't ping-pong between just two etcd nodes but are either arbitrary or round-robin.

@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

PS: We could just document that single control-plane nodes have the risk of slowly degrading due to fragmentation and that they are not meant for long-term operation. Work-arounds would be manual intervention (not recommended) or periodic temporary upgrades to a three control-plane-node scenario over night.

garloff added a commit that referenced this issue Mar 14, 2023
This is the quick-fix: We just don't do defragmentation on the etcd
leader. This avoids etcd service interruption on single-node etcd
clusters and spurious leader changes on multi-node ones.

Note that this is the intermediate step until we have a more complete
solution as depicted in
#384 (comment)

Signed-off-by: Kurt Garloff <kurt@garloff.de>
@garloff
Copy link
Contributor Author

garloff commented Mar 14, 2023

As we have the etcd fragmentation unconditionally enabled since merging #355 a few days ago, I want to ensure we don't cause trouble for users of the main branch and introduce a quick fix.
PR #387 does simply exit the defrag script on the leader, so we only defragment (and backup) etcd on the non-leader etcd cluster members. If we happen to get occasional leader changes, this would even almost create the desired state.
This should be significantly safer than the current state in main, but of course not as good as the approach depicted above. We should merge it IMVHO, if the better solution can not be created, validated and merged within the next few days.
@matofeder, @chess-knight, @batistein, @guettli -- thanks for your great input!
Any of you looking at implementing the suggested approach in the next days?

@matofeder
Copy link
Member

matofeder commented Mar 14, 2023

PS: We could just document that single control-plane nodes have the risk of slowly degrading due to fragmentation and that they are not meant for long-term operation. Work-arounds would be manual intervention (not recommended) or periodic temporary upgrades to a three control-plane-node scenario over night.

I agree

We should merge it IMVHO, if the better solution can not be created, validated and merged within the next few days.

Let's merge #387 as a hotfix.

Any of you looking at implementing the suggested approach in the next days?

I will take care of that.

garloff added a commit that referenced this issue Mar 15, 2023
This is the quick-fix: We just don't do defragmentation on the etcd
leader. This avoids etcd service interruption on single-node etcd
clusters and spurious leader changes on multi-node ones.

Note that this is the intermediate step until we have a more complete
solution as depicted in
#384 (comment)

Signed-off-by: Kurt Garloff <kurt@garloff.de>
@matofeder matofeder self-assigned this Mar 17, 2023
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 17, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 18, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 18, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 18, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 18, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 18, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
matofeder added a commit that referenced this issue Mar 18, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
chess-knight pushed a commit that referenced this issue Mar 22, 2023
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
@guettli
Copy link

guettli commented Mar 23, 2023

BTW, Gardener has related repository, and they plan this feature: Defragmentor of backup-restore should also consider the etcd db size along with scheduled defrag

garloff pushed a commit that referenced this issue Mar 31, 2023
* Update etcd defrag and backup

This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

* Fix etcd defrag script

Cloud-init doesn't like '{#' - jinja template comment

Fix also installation of etcdctl tool

* Add check to avoid defragmentation on unhealthy etcd cluster

* Add `force-*` optional arguments to the etcd defragmentation script

This commit adds optional arguments to the `etcd-defrag.sh` script that
allow skipping script checks.
Optional arguments are:
- `--force-single` (allows to execute defragmentation on single member etcd cluster)
- `--force-unhealthy` (allows to execute defragmentation on unhealthy etcd member)
- `--force-nonleader` (allows to execute defragmentation on non leader etcd member)

* Add etcd maintenance section into Maintenance_and_Troubleshooting.md

This commit adds etcd maintenance section into the
Maintenance_and_Troubleshooting docs. Section, for now, describes
the etcd defragmentation and backup script `etcd-defrag.sh`.

* fixup! Add etcd maintenance section into Maintenance_and_Troubleshooting.md

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Signed-off-by: Kurt Garloff <kurt@garloff.de>
Co-authored-by: Roman Hros <roman.hros@dnation.cloud>
garloff pushed a commit that referenced this issue May 1, 2023
* Update etcd defrag and backup

This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

* Fix etcd defrag script

Cloud-init doesn't like '{#' - jinja template comment

Fix also installation of etcdctl tool

* Add check to avoid defragmentation on unhealthy etcd cluster

* Add `force-*` optional arguments to the etcd defragmentation script

This commit adds optional arguments to the `etcd-defrag.sh` script that
allow skipping script checks.
Optional arguments are:
- `--force-single` (allows to execute defragmentation on single member etcd cluster)
- `--force-unhealthy` (allows to execute defragmentation on unhealthy etcd member)
- `--force-nonleader` (allows to execute defragmentation on non leader etcd member)

* Add etcd maintenance section into Maintenance_and_Troubleshooting.md

This commit adds etcd maintenance section into the
Maintenance_and_Troubleshooting docs. Section, for now, describes
the etcd defragmentation and backup script `etcd-defrag.sh`.

* fixup! Add etcd maintenance section into Maintenance_and_Troubleshooting.md

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Signed-off-by: Kurt Garloff <kurt@garloff.de>
Co-authored-by: Roman Hros <roman.hros@dnation.cloud>
garloff pushed a commit that referenced this issue May 1, 2023
* Update etcd defrag and backup

This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

* Fix etcd defrag script

Cloud-init doesn't like '{#' - jinja template comment

Fix also installation of etcdctl tool

* Add check to avoid defragmentation on unhealthy etcd cluster

* Add `force-*` optional arguments to the etcd defragmentation script

This commit adds optional arguments to the `etcd-defrag.sh` script that
allow skipping script checks.
Optional arguments are:
- `--force-single` (allows to execute defragmentation on single member etcd cluster)
- `--force-unhealthy` (allows to execute defragmentation on unhealthy etcd member)
- `--force-nonleader` (allows to execute defragmentation on non leader etcd member)

* Add etcd maintenance section into Maintenance_and_Troubleshooting.md

This commit adds etcd maintenance section into the
Maintenance_and_Troubleshooting docs. Section, for now, describes
the etcd defragmentation and backup script `etcd-defrag.sh`.

* fixup! Add etcd maintenance section into Maintenance_and_Troubleshooting.md

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Signed-off-by: Kurt Garloff <kurt@garloff.de>
Co-authored-by: Roman Hros <roman.hros@dnation.cloud>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants