-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal Apply Rate spikes #8096
Comments
|
Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
You need to wait one or two hours to allow leases to be accumulated before triggered a election. Then wait another hour to check spikes. |
Also k8s tests need to be running against etcd. |
@gyuho The spick doesn't appear for every leader election. Actually there're events generated every second in kubernetes cluster. |
i can confirm it is fixed with our new test result. @heyitsanthony is working on a better smooth algo though. |
Instead of unconditionally randomizing, extend leases on promotion if too many leases expire within the same time span. If the server has few leases or spread out expires, there will be no extension. Squashed previous commits for #8149. This is a combination of 4 commits below: lease: randomize expiry on initial refresh call Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address #8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com> integration: remove lease exist checking on randomized expiry Lease with TTL 5 should be renewed with randomization, thus it's still possible to exist after 3 seconds. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com> lessor: extend leases on promote if expires will be rate limited Instead of unconditionally randomizing, extend leases on promotion if too many leases expire within the same time span. If the server has few leases or spread out expires, there will be no extension. Revert "integration: remove lease exist checking on randomized expiry" This reverts commit 95bc33f. The new lease extension algorithm should pass this test.
Instead of unconditionally randomizing, extend leases on promotion if too many leases expire within the same time span. If the server has few leases or spread out expires, there will be no extension. Squashed previous commits for #8149. Author: Anthony Romano <anthony.romano@coreos.com> This is a combination of 4 commits below: lease: randomize expiry on initial refresh call Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address #8096. integration: remove lease exist checking on randomized expiry Lease with TTL 5 should be renewed with randomization, thus it's still possible to exist after 3 seconds. lessor: extend leases on promote if expires will be rate limited Instead of unconditionally randomizing, extend leases on promotion if too many leases expire within the same time span. If the server has few leases or spread out expires, there will be no extension. Revert "integration: remove lease exist checking on randomized expiry" This reverts commit 95bc33f. The new lease extension algorithm should pass this test.
Randomize the very first expiry on lease recovery to prevent recovered leases from expiring all at the same time. Address etcd-io#8096. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@xiang90 and me found the root cause and have solution. Just create this for reference.
Our testing cluster with kubemark showed spikes in
sum(rate(etcd_server_proposals_applied_total [1m]))
.Logs around the spike:
Nothing special on etcd server logs.
So we investigated the WAL entries, and found that all spikes are from lease revokes:
Dashboard with
rate(etcd_debugging_server_lease_expired_total[1m])
also confirms this in therun-etcd-1-2
node:If we further investigate the specific logs from
run-etcd-1-2
node (since latest snapshot log dump only showed lease revokes):TTL was 3600 (1-hour), and there was leader election an hour before.
And when leader election happens, we renew all the lease, and those renewed lease are being revoked all at the same time the hour later, thus spikes.
We plan to randomize leases on recovery, in addition to rate limiting lease revokes.
The text was updated successfully, but these errors were encountered: