-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Forcefully expire lease in integration test to fix flake #129812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jefftree The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
if lease.Spec.HolderIdentity != nil && *lease.Spec.HolderIdentity != leaseName { | ||
lease.Spec.RenewTime = &metav1.MicroTime{Time: time.Now().Add(-30 * time.Second)} | ||
_, err = clientset.CoordinationV1().Leases("kube-system").Update(ctx, lease, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can when the first apiserver took over meanwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: could the first apiserver go into some backoff when getting the validation errors repeated while trying to take over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can when the first apiserver took over meanwhile.
Yes you're right. I think it will avoid conflict if we expire the lease while the VAP is still active to prevent a write from the first apiserver.
Another thought: could the first apiserver go into some backoff when getting the validation errors repeated while trying to take over?
ugh I think that's exactly what's causing the flakes. I noticed that there was considerable delay in renewal in the flakes and originally attributed it to degraded performance in CI. Given our jitter factor of 1.2 and retryperiod of 2s, we have a max delay of approx 2 * (1+1.2) = 4.4s
. Thus, given the lease duration of 5s and max delay of 4.4s, the max time it takes for renewal is approx 9.4s
which is technically under the 10s limit of the integration test wait timeout but is fragile if resources are constrained.
Reducing the retryperiod to 1s and expiring the lease instantly will yield a max delay of 1 * (1+1.2) = 2.2s
which is well under the 10s limit of the integration test timeout. I've documented this a bit better, lmk if the solution suffices.
test/integration/apiserver/coordinatedleaderelection/leaderelection_test.go
Show resolved
Hide resolved
/test pull-kubernetes-cmd |
What type of PR is this?
/kind flake
What this PR does / why we need it:
This test could flake because we rely on a very tight timing window for a lease to expire and another client to pick up the lease. This PR forcefully expires the lease so we do not need to wait until lease is expired and reduces the retryPeriod, reducing the maximum interval until a lease is picked up from 7s to 1s. Ideally we don't need to deal with any time constants at all, but I feel that mocking everything will defeat the purpose of the test .
Per the configurations:
LeaseDuration = 5s
RetryPeriod = 2s
In theory, the maximum interval until the lease is picked up is 5s+2s = 7s. However, because apiserver requests are not instantaneous and due to apf and sharing resources in integration tests, there could be occasional occurrences of this surpassing 10s. Investigating the logs of flaking tests show that the the actual write times differ compared to the prescribed intervals and could be an entire magnitude slower. 10s is the exact threshold where we timeout this test and fail with the error
"Expected the cle lease lock to transition to the first apiserver"
.Which issue(s) this PR fixes:
Fixes #129802
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/triage accepted
/assign @sttts @jpbetz