-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941
Conversation
c136e85
to
53ee739
Compare
bdaadbf
to
4307bdd
Compare
Thank you. This will be very helpful for an infrequent defragmentation that does not impacts tile-latency of requests served by nodes in clusters. If member is updated or is restared (e.g in response to NO_SPACE alarm), it can be configured to perform some autohealing action, like 'defrag'. This mitigation was discussed on community meeting on July 30, 2020 in response to Marek: Please add a test (e.g. e2e) to at least guarantee that setting this flag does not crashes the server. |
Yes, defragging live node is quite disruptive, even causing error too many requests. Defrag on bootstrap seems safe, on the other hand. |
Added e2e test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense, while this can fairly trivially be added as an init container it does have value as a server runtime.
b2cffb4
to
535cbb5
Compare
Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)? |
What scenario do you envision ? My reasoning: I assume we are running this before 'concurrent' code within etcd is initialized. We could even move it before |
Just high level I have not dug into this yet but if this code happens before listeners would anything block defrag from attempting to run against an already running etcd process?
|
zap.String("experimental-bootstrap-defrag-threshold", humanize.Bytes(uint64(thresholdBytes))), | ||
) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably also should log here for the non-skipping case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to think so, but be.Defrag internally has pretty decent logging already.
The reclamation of disk space can be important. But I guess the most important role is to rearrange the on-disk pages and reduce the size of freelist. Thus it can also help with write throughput/latency (reduce random writes). So I am not sure if the size factor is the most important one to config. We could check how many small holes are there and let users config it. But I do not really want to make the flag configuration too complicated either. |
@xiang90 I assume that size is the good proxy to number of pages being cleaned up (size/4096), so number of entries being deleted from the free-pages list. From my perspective the goals are:
|
Because of these two, defrag only at the boot time might not be super effective on space reclaim or RAM space saving. If we really want to reduce the snapshot sending size, we'd better compact it before sending (do not send empty pages as it is now?). If we really want to save RAM size, we need to be smarter on mmap control (do not map large holes) and page allocations. Most of the issues we see in production for large cluster are huge freelist (increase search time) and the random writes (multiple non-leaf branches needs to be updates as well comparing to seq writes). |
Not entirely true when there are a lot of small holes vs big holes. The item in the freelist should be span instead of items? But I guess most of the time it can be a good indicator? |
xiang90
|
Defragment is an expensive operation to execute when serving traffic as requires locking database for multiple seconds. Instead of defragmenting during operation of Etcd we can mitigate this cost by moving this process to server bootstrap. This also has benefit of reducing maintenance cost for users that didn't setup full automation to trigger defrag periodically, but still do some other operations like upgrades.
This PR adds new
--experimental-bootstrap-defrag-threshold-megabytes
flag to etcdserver that allows users to set a disk size in megabytes. During bootstrap if disk size that would be freed by defrag is greater then threshold set, etcdserver will automatically execute defrag before starting to serve traffic.@ptabor