Script for defragmentation #15477

guettli · 2023-03-14T09:01:51Z

What would you like to be added?

I would like to see an official solution how to defragment etcd.

AFAIK a one-line cron-job is not enough, since you should not defragment the current leader.

Related: #14975

Maybe it is enough to add a simple example script to the docs.

Why is this needed?

defragmenting the leader can lead to performance degradation and should be avoided.

I don't think it makes sense that every company running etcd invents its own way to solve this.

guettli · 2023-03-14T09:03:51Z

Here is one possible solution: https://github.com/ugur99/etcd-defrag-cronjob

jmhbnz · 2023-03-14T09:17:12Z

Thanks for raising the discussion on this. One complication with this is that while the largest, kubernetes is not the only user of etcd.

With this in mind I think we would need to consider what would be best suited to sit under etcd op guide docs versus kubernetes etcd operations docs.

It may make more sense for this issue or a tandem issue to be raised against the kubernetes etcd operations docs?

guettli · 2023-03-14T10:07:23Z

I would like to have a solution (or documentation) for etcd.io first. I got bitten by outdated etcd docs on kubernetes.io once, and think having docs at two places is confusing.

chaochn47 · 2023-03-14T17:58:46Z

defragmenting the leader can lead to performance degradation and should be avoided.

Hi @guettli

I think defraging against leader is equivalent to against followers. Raft is not blocking because of rewriting the db file generally speaking.

For example, etcdctl endpoint status will show Raft Index is incrementing (commited Index) but Raft Applied Index is not during defraging on leader.

guettli · 2023-03-15T09:40:31Z

@chaochn47 thank you for you answer. What is your advice for defragmenting etcd? How do you handle it?

chaochn47 · 2023-03-15T22:08:06Z

Hi @guettli, Here is how I would suggest

Every a couple of minutes, evaluates if etcd should run defrag.

It will run defrag if

More than 500 MB space can be freed. AND
DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight)

It is guaranteed defrag won’t occur on more than one node at any given time.

tjungblu · 2023-03-16T11:13:06Z

Raft is not blocking because of rewriting the db file generally speaking.

That's true, in OpenShift we recommend doing the leader last, because the additional IO/memory/cache churn can impact the performance negatively. If a defrag takes down the leader, the other nodes are at least safely defrag'd already and can continue with the next election. We also do not defrag if any member is unhealthy.

@guettli Are you looking for a simple bash script in etcd/contrib or something more official as part of the CLI?

guettli · 2023-03-17T13:15:54Z

@tjungblu I don't have a preference about what the solution looks like. It could be a shell script, something added to etcdutil or maybe just some docs.

@chaochn47 explained the steps, but I a not familiar enough with etcd to write a corresponding script to implement this. I hope that someone with more knowledge of etcd can provide an executable solution.

jmhbnz · 2023-03-17T18:19:53Z

Taking a quick look at how etcdctl defrag works currently I'm wondering if we should make func defragCommandFunc more opinionated so that if it is passed the --cluster flag it would complete the defrag on all non leader members first and then do leader.

This would simplify any downstream implementations of defrag functionality as each implementation would not have to reinvent how to prioritize the cluster wide defrag provided they were built on top of etcdctl.

We could then update website docs or add contrib/defrag reference implementation for Kubernetes.

guettli · 2023-03-17T22:05:13Z

@jmhbnz would it be possible to get this into etcdctl:

It will run defrag if
More than 500 MB space can be freed. AND
DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight)
It is guaranteed defrag won’t occur on more than one node at any given time.

Then calling defragmentation does not need to be wrapped in a "dirty" shell.

jmhbnz · 2023-03-19T23:27:57Z

Hey @guettli - I don't think we can build all of that into etcdctl as for example etcdctl doesn't do any cron style scheduling currently to my knowledge.

As mentioned I do think we can solve the issue of completing defrag for members one at a time first before the leader as a built in approach in etcdctl provided --cluster flag is used.

For some of the other requirements we have been working out in this issue, like scheduling or perhaps some of the monitoring based checks I think those will need to be handled as either documentation or additional resources in etcd/contrib for example a kubernetes cronjob or shell script example implementation.

@ahrtr, @serathius - Keen for maintainer input on this. If what I have suggested makes sense feel free to assign to me and I can work on it.

jmhbnz · 2023-04-11T08:08:09Z

Apologies, removing my assignment for this as I am about to be traveling for several weeks and attending Kubecon so I likely won't have much capacity for a while. If anyone else has capacity they are welcome to pick it up.

serathius · 2023-04-11T09:03:15Z

I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.

geetasg · 2023-04-11T15:31:11Z

#9222 looks related. Seems like reducing bbolt fragmentation will be a third option in addition to option 1 and option 2 mentioned here - #9222 (comment). Has this been discussed before and was there any conclusion on preferred design approach? Should contributors interested in solving this start from scratch or build upon prior guidance? @serathius @ptabor /cc @chaochn47 @cenkalti

cenkalti · 2023-04-11T17:18:38Z

To me, it makes more sense to fix this at BoltDB layer. By design (based on LMDB), the database should not require any maintenance operation. BoltDB has FillPercent parameter to control page utilization when adding items to a page but no control when removing items from a page. Related: etcd-io/bbolt#422

tjungblu · 2023-04-12T08:09:46Z

From K8s perspective most fragmentation we see comes from Events, OpenShift also suffers from Images (CRDs for container image builds) on build-heavy clusters.

On larger clusters we advise to shard those to another etcd instance within the cluster, but maybe we can offer some "ephemeral keys" that have some more relaxed storage and consistency guarantees? Or which use a different storage than bbolt, rocksdb/leveldb (or anything LSM based)...

guettli · 2023-04-12T09:16:03Z

I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.

@serathius this would be great. Having a cron-job which defragments the non-leaders first, then the leader is extra overhead. Especially since there is no official version of such a script, and people solve the same task again and again.

Let me know, if I can help somehow.

serathius · 2023-04-12T12:32:14Z

cc @ptabor who mentioned some ideas to limit bbolt fragmentation.

ahrtr · 2023-04-12T22:32:39Z

Note that I don't expect bbolt side change, at least in the near future, because we are still struggling to reproduce etcd-io/bbolt#402 and etcd-io/bbolt#446.

I think it makes sense to provide a official reference (just reference!) on how to perform defragmentation. The rough idea (on top of all the inputs in this thread, e.g. @tjungblu @chaochn47 etc.) is:

Defragmentation is a time-consuming task, so it's recommended to do it for members one by one;
Please do not do defragmentation if any member is unhealthy;
It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);
There is a known issue that etcd might run into data inconsistency issue if it crashes in the middle of an online defragmentation operation using etcdctl or clientv3 API. All the existing v3.5 releases are affected, including 3.5.0 ~ 3.5.5. So please use etcdutl to offline perform defragmentation operation, but this requires taking each member offline one at a time. It means that you need to stop each etcd instance firstly, then perform defragmentation using etcdutl, start the instance at last. Please refer to the issue 1 in public statement

Please also see Compaction & Defragmentation

I might spend some time to provide a such script for reference.

guettli · 2023-04-13T10:20:01Z

@ahrtr "I might spend some time to provide a such script for reference."

An official script would realy help here. The topic is too hot to let everybody re-solve this on its own.

bradjones1320 · 2023-04-18T20:05:28Z

I've been writing my own script for this. I guess my biggest question is:
If I get all the members of my cluster and then loop through them in bash, and execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish before moving on to the next member?

chaochn47 · 2023-04-18T20:28:39Z

execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish

It does not. Please take a look at the discussion in #15664.

You could use curl -sL http://localhost:2379 | grep "etcd_disk_defrag_inflight" to determine if defrag has been completed in your script.

etcd/server/storage/backend/backend.go

Lines 450 to 451 in 9e1abba

    
           isDefragActive.Set(1) 
        
           defer isDefragActive.Set(0)

ahrtr · 2023-04-18T21:36:55Z

execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish

It does not.

It isn't correct. It waits for the defrag to finish before moving on to the next member.

FYI. I am implementing a tool to do defragmentation. Hopefully the first version can be ready next week.

cenkalti · 2023-04-18T21:48:40Z

It does wait but it timeouts after some duration. IIRC it's 30 seconds.

ahrtr · 2023-04-18T21:52:58Z

It does wait but it timeouts after some duration

Yes, it's another story. The default command timeout is 5s. It's recommended to set a bigger value (e.g. 1m. ) for defragmentation, because it may take a long time to defragment a large DB. I don't have the performance data for now on how much time it may need for different DB size.

cenkalti · 2023-04-18T22:12:19Z

The pattern I see is usually 10s per GB. @bradjones1320 you can set a larger timeout with --command-timeout=60s flag.

ahrtr · 2023-04-19T02:40:21Z

FYI. https://github.com/ahrtr/etcd-defrag

Just as I mentioned in #15477 (comment), the tool etcd-defrag,

run defragmentation only when all members are healthy. Note that it ignores the NOSPACE alarm
run defragmentation on the leader last

cenkalti · 2023-04-21T16:44:19Z

@ahrtr

It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);

When saying "stop-the-world", are you only referring to the following check:

etcd/server/etcdserver/v3_server.go

Lines 669 to 671 in 63c9fe1

    
           if ci > ai+maxGapBetweenApplyAndCommitIndex { 
        
           	return nil, errors.ErrTooManyRequests 
        
           }

or there may be other reasons that might stop-the-world?

ahrtr · 2023-04-21T21:50:01Z

When etcdserver is processing the defragmentation, it can't serve any client requests, see

etcd/server/storage/backend/backend.go

Lines 456 to 465 in 63c9fe1

    
           b.batchTx.LockOutsideApply() 
        
           defer b.batchTx.Unlock() 
        
           // lock database after lock tx to avoid deadlock. 
        
           b.mu.Lock() 
        
           defer b.mu.Unlock() 
        
           // block concurrent read requests while resetting tx 
        
           b.readTx.Lock() 
        
           defer b.readTx.Unlock()

The main functionality of https://github.com/ahrtr/etcd-defrag is ready, the left work is to add more utilities (e.g. Dockerfile, manifest for K8s, etc.). Please anyone feel free to let me know if you have any suggestions or questions.

miancheng7 · 2023-04-22T07:42:15Z

@ahrtr

It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);

Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.

FYI, I did a test on a 3 nodes cluster. When defraging, Etcd leader node healthy check failed but there was not leader election.

Test logic

1. Feed 8GiB data to Etcd cluster.
2. Set up clients, continuously send Read/Write to all nodes,
3. Start defrag in leader.
4. Check cluster healthy
5. Check if there are leader election

test output

Before defragment, the cluster raft term is 7

% etcdctl endpoint status --cluster
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.7 |  8.0 GB |      true |         7 |    1394738 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

When defraging in leader, the leader is unhealthy, clients connecting to the leader are blocked or receive "too many requests" error.

status = StatusCode.UNKNOWN
details = "etcdserver: too many requests"

The defragment takes 6m17s

{
  "msg": "finished defragmenting directory",
  "current-db-size": "8.0 GB",
  "took": "6m17.969308853s"
}

After defragment, leader becomes healthy and raft term is still 7 which means no leader transfer.

% etcdctl endpoint status --cluster -w table
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 |   3.5.7 |  8.1 GB |     false |         7 |    1808612 |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.7 |  8.0 GB |      true |         7 |    1808612 |
| http://127.0.0.1:32379 | fd422379fda50e48 |   3.5.7 |  8.1 GB |     false |         7 |    1808612 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

ahrtr · 2023-04-22T08:56:20Z

It turned out to be that the leader doesn't stop the world during processing defragmentation, because the apply workflow is executed async,

etcd/server/etcdserver/server.go

Line 847 in 63c9fe1

    
           f := schedule.NewJob("server_applyAll", func(context.Context) { s.applyAll(&ep, &ap) })

so defrag will not cause leadership transfer.

Confirmed that it doesn't cause leadership transfer no matter how long the leader is blocked on processing defragmentation. It should be an issue, and we should fix it. I have a pending PR #15440, let me think how to resolve them together.

ahrtr · 2023-04-22T09:01:16Z

Again, it's still recommended to run defragmentation on leader last, because leader has more responsibilities (e.g. send snapshot, etc.) than followers, once it's blocked for a long time, then all the responsibilities dedicated to leader are not working.

Please also read https://github.com/ahrtr/etcd-defrag

ahrtr · 2023-04-22T11:54:40Z

Since we already have https://github.com/ahrtr/etcd-defrag, can we close this ticket? @guettli

FYI. I might formally release etcd-defrag v0.1.0 in the following 1 ~ 2 weeks.

tjungblu · 2023-04-24T08:03:38Z

Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.

A defrag call will not cause leadership transfer, but the resulting IO+CPU load might cause this. Try again on a machine with a very slow disk or limited CPU. We've definitely seen this happening on loaded control planes.

guettli · 2023-04-27T07:16:55Z

closing this, since https://github.com/ahrtr/etcd-defrag exists.

TechDufus · 2023-04-27T12:34:55Z

It would still be awesome to get built-in / official support for this, yeah?

Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?

guettli · 2023-04-27T14:29:54Z

@TechDufus good question. If you know the answer, please write it here into this issue. Thank you.

ahrtr · 2023-04-27T22:09:50Z

Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?

I am the owner of the tool, so I will definitely support it;
It's an open source project, so any contribution is welcome.

guettli added the type/feature label Mar 14, 2023

guettli mentioned this issue Mar 14, 2023

etcd defrag + backup: Avoid too many leader changes SovereignCloudStack/k8s-cluster-api-provider#384

Closed

jmhbnz assigned jmhbnz and unassigned jmhbnz Apr 8, 2023

ahrtr mentioned this issue Apr 14, 2023

Nominate @tjungblu @cenkalti and @pavelkalinnikov as members #15714

Closed

ahrtr mentioned this issue Apr 22, 2023

etcdserver: add watchdog to detect stalled writes #15440

Closed

guettli closed this as completed Apr 27, 2023

ahrtr mentioned this issue Apr 28, 2023

Update how_to_debug_large_db_size_issue.md to add more useful info etcd-io/website#678

Merged

techcobweb mentioned this issue Jun 6, 2024

Galasa needs to periodically defrag etcd galasa-dev/projectmanagement#1887

Open

4 tasks

Script for defragmentation #15477

Script for defragmentation #15477

Comments

guettli commented Mar 14, 2023

What would you like to be added?

Why is this needed?

guettli commented Mar 14, 2023

jmhbnz commented Mar 14, 2023 • edited Loading

guettli commented Mar 14, 2023

chaochn47 commented Mar 14, 2023 • edited Loading

guettli commented Mar 15, 2023

chaochn47 commented Mar 15, 2023

tjungblu commented Mar 16, 2023

guettli commented Mar 17, 2023

jmhbnz commented Mar 17, 2023

guettli commented Mar 17, 2023

jmhbnz commented Mar 19, 2023

jmhbnz commented Apr 11, 2023

serathius commented Apr 11, 2023

geetasg commented Apr 11, 2023 • edited Loading

cenkalti commented Apr 11, 2023

tjungblu commented Apr 12, 2023 • edited Loading

guettli commented Apr 12, 2023

serathius commented Apr 12, 2023

ahrtr commented Apr 12, 2023 • edited Loading

guettli commented Apr 13, 2023

bradjones1320 commented Apr 18, 2023

chaochn47 commented Apr 18, 2023

ahrtr commented Apr 18, 2023

cenkalti commented Apr 18, 2023

ahrtr commented Apr 18, 2023

cenkalti commented Apr 18, 2023

ahrtr commented Apr 19, 2023

cenkalti commented Apr 21, 2023

ahrtr commented Apr 21, 2023

miancheng7 commented Apr 22, 2023 • edited Loading

Test logic

test output

ahrtr commented Apr 22, 2023

ahrtr commented Apr 22, 2023

ahrtr commented Apr 22, 2023

tjungblu commented Apr 24, 2023

guettli commented Apr 27, 2023

TechDufus commented Apr 27, 2023

guettli commented Apr 27, 2023

ahrtr commented Apr 27, 2023

jmhbnz commented Mar 14, 2023 •

edited

Loading

chaochn47 commented Mar 14, 2023 •

edited

Loading

geetasg commented Apr 11, 2023 •

edited

Loading

tjungblu commented Apr 12, 2023 •

edited

Loading

ahrtr commented Apr 12, 2023 •

edited

Loading

miancheng7 commented Apr 22, 2023 •

edited

Loading