Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for defragmentation #15477

Closed
guettli opened this issue Mar 14, 2023 · 38 comments
Closed

Script for defragmentation #15477

guettli opened this issue Mar 14, 2023 · 38 comments

Comments

@guettli
Copy link

guettli commented Mar 14, 2023

What would you like to be added?

I would like to see an official solution how to defragment etcd.

AFAIK a one-line cron-job is not enough, since you should not defragment the current leader.

Related: #14975

Maybe it is enough to add a simple example script to the docs.

Why is this needed?

defragmenting the leader can lead to performance degradation and should be avoided.

I don't think it makes sense that every company running etcd invents its own way to solve this.

@guettli
Copy link
Author

guettli commented Mar 14, 2023

Here is one possible solution: https://github.com/ugur99/etcd-defrag-cronjob

@jmhbnz
Copy link
Member

jmhbnz commented Mar 14, 2023

Thanks for raising the discussion on this. One complication with this is that while the largest, kubernetes is not the only user of etcd.

With this in mind I think we would need to consider what would be best suited to sit under etcd op guide docs versus kubernetes etcd operations docs.

It may make more sense for this issue or a tandem issue to be raised against the kubernetes etcd operations docs?

@guettli
Copy link
Author

guettli commented Mar 14, 2023

I would like to have a solution (or documentation) for etcd.io first. I got bitten by outdated etcd docs on kubernetes.io once, and think having docs at two places is confusing.

@chaochn47
Copy link
Member

chaochn47 commented Mar 14, 2023

defragmenting the leader can lead to performance degradation and should be avoided.

Hi @guettli

I think defraging against leader is equivalent to against followers. Raft is not blocking because of rewriting the db file generally speaking.

For example, etcdctl endpoint status will show Raft Index is incrementing (commited Index) but Raft Applied Index is not during defraging on leader.

@guettli
Copy link
Author

guettli commented Mar 15, 2023

@chaochn47 thank you for you answer. What is your advice for defragmenting etcd? How do you handle it?

@chaochn47
Copy link
Member

Hi @guettli, Here is how I would suggest

Every a couple of minutes, evaluates if etcd should run defrag.

It will run defrag if

  • More than 500 MB space can be freed. AND
  • DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight)

It is guaranteed defrag won’t occur on more than one node at any given time.

@tjungblu
Copy link
Contributor

Raft is not blocking because of rewriting the db file generally speaking.

That's true, in OpenShift we recommend doing the leader last, because the additional IO/memory/cache churn can impact the performance negatively. If a defrag takes down the leader, the other nodes are at least safely defrag'd already and can continue with the next election. We also do not defrag if any member is unhealthy.

@guettli Are you looking for a simple bash script in etcd/contrib or something more official as part of the CLI?

@guettli
Copy link
Author

guettli commented Mar 17, 2023

@tjungblu I don't have a preference about what the solution looks like. It could be a shell script, something added to etcdutil or maybe just some docs.

@chaochn47 explained the steps, but I a not familiar enough with etcd to write a corresponding script to implement this. I hope that someone with more knowledge of etcd can provide an executable solution.

@jmhbnz
Copy link
Member

jmhbnz commented Mar 17, 2023

Taking a quick look at how etcdctl defrag works currently I'm wondering if we should make func defragCommandFunc more opinionated so that if it is passed the --cluster flag it would complete the defrag on all non leader members first and then do leader.

This would simplify any downstream implementations of defrag functionality as each implementation would not have to reinvent how to prioritize the cluster wide defrag provided they were built on top of etcdctl.

We could then update website docs or add contrib/defrag reference implementation for Kubernetes.

@guettli
Copy link
Author

guettli commented Mar 17, 2023

@jmhbnz would it be possible to get this into etcdctl:

It will run defrag if
More than 500 MB space can be freed. AND
DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight)
It is guaranteed defrag won’t occur on more than one node at any given time.

Then calling defragmentation does not need to be wrapped in a "dirty" shell.

@jmhbnz
Copy link
Member

jmhbnz commented Mar 19, 2023

Hey @guettli - I don't think we can build all of that into etcdctl as for example etcdctl doesn't do any cron style scheduling currently to my knowledge.

As mentioned I do think we can solve the issue of completing defrag for members one at a time first before the leader as a built in approach in etcdctl provided --cluster flag is used.

For some of the other requirements we have been working out in this issue, like scheduling or perhaps some of the monitoring based checks I think those will need to be handled as either documentation or additional resources in etcd/contrib for example a kubernetes cronjob or shell script example implementation.

@ahrtr, @serathius - Keen for maintainer input on this. If what I have suggested makes sense feel free to assign to me and I can work on it.

@jmhbnz jmhbnz assigned jmhbnz and unassigned jmhbnz Apr 8, 2023
@jmhbnz
Copy link
Member

jmhbnz commented Apr 11, 2023

Apologies, removing my assignment for this as I am about to be traveling for several weeks and attending Kubecon so I likely won't have much capacity for a while. If anyone else has capacity they are welcome to pick it up.

@serathius
Copy link
Member

I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.

@geetasg
Copy link

geetasg commented Apr 11, 2023

#9222 looks related. Seems like reducing bbolt fragmentation will be a third option in addition to option 1 and option 2 mentioned here - #9222 (comment). Has this been discussed before and was there any conclusion on preferred design approach? Should contributors interested in solving this start from scratch or build upon prior guidance? @serathius @ptabor /cc @chaochn47 @cenkalti

@cenkalti
Copy link
Member

To me, it makes more sense to fix this at BoltDB layer. By design (based on LMDB), the database should not require any maintenance operation. BoltDB has FillPercent parameter to control page utilization when adding items to a page but no control when removing items from a page. Related: etcd-io/bbolt#422

@tjungblu
Copy link
Contributor

tjungblu commented Apr 12, 2023

From K8s perspective most fragmentation we see comes from Events, OpenShift also suffers from Images (CRDs for container image builds) on build-heavy clusters.

On larger clusters we advise to shard those to another etcd instance within the cluster, but maybe we can offer some "ephemeral keys" that have some more relaxed storage and consistency guarantees? Or which use a different storage than bbolt, rocksdb/leveldb (or anything LSM based)...

@guettli
Copy link
Author

guettli commented Apr 12, 2023

I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.

@serathius this would be great. Having a cron-job which defragments the non-leaders first, then the leader is extra overhead. Especially since there is no official version of such a script, and people solve the same task again and again.

Let me know, if I can help somehow.

@serathius
Copy link
Member

cc @ptabor who mentioned some ideas to limit bbolt fragmentation.

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2023

Note that I don't expect bbolt side change, at least in the near future, because we are still struggling to reproduce etcd-io/bbolt#402 and etcd-io/bbolt#446.

I think it makes sense to provide a official reference (just reference!) on how to perform defragmentation. The rough idea (on top of all the inputs in this thread, e.g. @tjungblu @chaochn47 etc.) is:

  1. Defragmentation is a time-consuming task, so it's recommended to do it for members one by one;
  2. Please do not do defragmentation if any member is unhealthy;
  3. It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);
  4. There is a known issue that etcd might run into data inconsistency issue if it crashes in the middle of an online defragmentation operation using etcdctl or clientv3 API. All the existing v3.5 releases are affected, including 3.5.0 ~ 3.5.5. So please use etcdutl to offline perform defragmentation operation, but this requires taking each member offline one at a time. It means that you need to stop each etcd instance firstly, then perform defragmentation using etcdutl, start the instance at last. Please refer to the issue 1 in public statement

Please also see Compaction & Defragmentation

I might spend some time to provide a such script for reference.

@guettli
Copy link
Author

guettli commented Apr 13, 2023

@ahrtr "I might spend some time to provide a such script for reference."

An official script would realy help here. The topic is too hot to let everybody re-solve this on its own.

@bradjones1320
Copy link

I've been writing my own script for this. I guess my biggest question is:
If I get all the members of my cluster and then loop through them in bash, and execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish before moving on to the next member?

@chaochn47
Copy link
Member

execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish

It does not. Please take a look at the discussion in #15664.

You could use curl -sL http://localhost:2379 | grep "etcd_disk_defrag_inflight" to determine if defrag has been completed in your script.

isDefragActive.Set(1)
defer isDefragActive.Set(0)

@ahrtr
Copy link
Member

ahrtr commented Apr 18, 2023

execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish

It does not.

It isn't correct. It waits for the defrag to finish before moving on to the next member.

FYI. I am implementing a tool to do defragmentation. Hopefully the first version can be ready next week.

@cenkalti
Copy link
Member

It does wait but it timeouts after some duration. IIRC it's 30 seconds.

@ahrtr
Copy link
Member

ahrtr commented Apr 18, 2023

It does wait but it timeouts after some duration

Yes, it's another story. The default command timeout is 5s. It's recommended to set a bigger value (e.g. 1m. ) for defragmentation, because it may take a long time to defragment a large DB. I don't have the performance data for now on how much time it may need for different DB size.

@cenkalti
Copy link
Member

The pattern I see is usually 10s per GB. @bradjones1320 you can set a larger timeout with --command-timeout=60s flag.

@ahrtr
Copy link
Member

ahrtr commented Apr 19, 2023

FYI. https://github.com/ahrtr/etcd-defrag

Just as I mentioned in #15477 (comment), the tool etcd-defrag,

  • run defragmentation only when all members are healthy. Note that it ignores the NOSPACE alarm
  • run defragmentation on the leader last

@cenkalti
Copy link
Member

@ahrtr

It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);

When saying "stop-the-world", are you only referring to the following check:

if ci > ai+maxGapBetweenApplyAndCommitIndex {
return nil, errors.ErrTooManyRequests
}

or there may be other reasons that might stop-the-world?

@ahrtr
Copy link
Member

ahrtr commented Apr 21, 2023

When etcdserver is processing the defragmentation, it can't serve any client requests, see

b.batchTx.LockOutsideApply()
defer b.batchTx.Unlock()
// lock database after lock tx to avoid deadlock.
b.mu.Lock()
defer b.mu.Unlock()
// block concurrent read requests while resetting tx
b.readTx.Lock()
defer b.readTx.Unlock()

The main functionality of https://github.com/ahrtr/etcd-defrag is ready, the left work is to add more utilities (e.g. Dockerfile, manifest for K8s, etc.). Please anyone feel free to let me know if you have any suggestions or questions.

@miancheng7
Copy link

miancheng7 commented Apr 22, 2023

@ahrtr

It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);

Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.

FYI, I did a test on a 3 nodes cluster. When defraging, Etcd leader node healthy check failed but there was not leader election.

Test logic

1. Feed 8GiB data to Etcd cluster.
2. Set up clients, continuously send Read/Write to all nodes,
3. Start defrag in leader.
4. Check cluster healthy
5. Check if there are leader election

test output

  • Before defragment, the cluster raft term is 7
% etcdctl endpoint status --cluster
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.7 |  8.0 GB |      true |         7 |    1394738 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
  • When defraging in leader, the leader is unhealthy, clients connecting to the leader are blocked or receive "too many requests" error.
status = StatusCode.UNKNOWN
details = "etcdserver: too many requests"
  • The defragment takes 6m17s
{
  "msg": "finished defragmenting directory",
  "current-db-size": "8.0 GB",
  "took": "6m17.969308853s"
}
  • After defragment, leader becomes healthy and raft term is still 7 which means no leader transfer.
% etcdctl endpoint status --cluster -w table
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 |   3.5.7 |  8.1 GB |     false |         7 |    1808612 |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.7 |  8.0 GB |      true |         7 |    1808612 |
| http://127.0.0.1:32379 | fd422379fda50e48 |   3.5.7 |  8.1 GB |     false |         7 |    1808612 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

@ahrtr
Copy link
Member

ahrtr commented Apr 22, 2023

It turned out to be that the leader doesn't stop the world during processing defragmentation, because the apply workflow is executed async,

f := schedule.NewJob("server_applyAll", func(context.Context) { s.applyAll(&ep, &ap) })

so defrag will not cause leadership transfer.

Confirmed that it doesn't cause leadership transfer no matter how long the leader is blocked on processing defragmentation. It should be an issue, and we should fix it. I have a pending PR #15440, let me think how to resolve them together.

@ahrtr
Copy link
Member

ahrtr commented Apr 22, 2023

Again, it's still recommended to run defragmentation on leader last, because leader has more responsibilities (e.g. send snapshot, etc.) than followers, once it's blocked for a long time, then all the responsibilities dedicated to leader are not working.

Please also read https://github.com/ahrtr/etcd-defrag

@ahrtr
Copy link
Member

ahrtr commented Apr 22, 2023

Since we already have https://github.com/ahrtr/etcd-defrag, can we close this ticket? @guettli

FYI. I might formally release etcd-defrag v0.1.0 in the following 1 ~ 2 weeks.

@tjungblu
Copy link
Contributor

Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.

A defrag call will not cause leadership transfer, but the resulting IO+CPU load might cause this. Try again on a machine with a very slow disk or limited CPU. We've definitely seen this happening on loaded control planes.

@guettli
Copy link
Author

guettli commented Apr 27, 2023

closing this, since https://github.com/ahrtr/etcd-defrag exists.

@guettli guettli closed this as completed Apr 27, 2023
@TechDufus
Copy link

It would still be awesome to get built-in / official support for this, yeah?

Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?

@guettli
Copy link
Author

guettli commented Apr 27, 2023

@TechDufus good question. If you know the answer, please write it here into this issue. Thank you.

@ahrtr
Copy link
Member

ahrtr commented Apr 27, 2023

Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?

  • I am the owner of the tool, so I will definitely support it;
  • It's an open source project, so any contribution is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests