Skip to content

Forcefully expire lease in integration test to fix flake #129812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jefftree
Copy link
Member

What type of PR is this?

/kind flake

What this PR does / why we need it:

This test could flake because we rely on a very tight timing window for a lease to expire and another client to pick up the lease. This PR forcefully expires the lease so we do not need to wait until lease is expired and reduces the retryPeriod, reducing the maximum interval until a lease is picked up from 7s to 1s. Ideally we don't need to deal with any time constants at all, but I feel that mocking everything will defeat the purpose of the test .

Per the configurations:
LeaseDuration = 5s
RetryPeriod = 2s

In theory, the maximum interval until the lease is picked up is 5s+2s = 7s. However, because apiserver requests are not instantaneous and due to apf and sharing resources in integration tests, there could be occasional occurrences of this surpassing 10s. Investigating the logs of flaking tests show that the the actual write times differ compared to the prescribed intervals and could be an entire magnitude slower. 10s is the exact threshold where we timeout this test and fail with the error "Expected the cle lease lock to transition to the first apiserver".

Which issue(s) this PR fixes:

Fixes #129802

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/triage accepted
/assign @sttts @jpbetz

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 24, 2025
@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 24, 2025
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jefftree

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2025
}
if lease.Spec.HolderIdentity != nil && *lease.Spec.HolderIdentity != leaseName {
lease.Spec.RenewTime = &metav1.MicroTime{Time: time.Now().Add(-30 * time.Second)}
_, err = clientset.CoordinationV1().Leases("kube-system").Update(ctx, lease, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this conflict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can when the first apiserver took over meanwhile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: could the first apiserver go into some backoff when getting the validation errors repeated while trying to take over?

Copy link
Member Author

@Jefftree Jefftree Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can when the first apiserver took over meanwhile.

Yes you're right. I think it will avoid conflict if we expire the lease while the VAP is still active to prevent a write from the first apiserver.

Another thought: could the first apiserver go into some backoff when getting the validation errors repeated while trying to take over?

ugh I think that's exactly what's causing the flakes. I noticed that there was considerable delay in renewal in the flakes and originally attributed it to degraded performance in CI. Given our jitter factor of 1.2 and retryperiod of 2s, we have a max delay of approx 2 * (1+1.2) = 4.4s. Thus, given the lease duration of 5s and max delay of 4.4s, the max time it takes for renewal is approx 9.4s which is technically under the 10s limit of the integration test wait timeout but is fragile if resources are constrained.

Reducing the retryperiod to 1s and expiring the lease instantly will yield a max delay of 1 * (1+1.2) = 2.2s which is well under the 10s limit of the integration test timeout. I've documented this a bit better, lmk if the solution suffices.

@Jefftree
Copy link
Member Author

/test pull-kubernetes-cmd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
4 participants