Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/search/controllers_test: reduce the probability of flaky test cases by waiting for cache to sync #5936

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mohamedawnallah
Copy link
Contributor

@mohamedawnallah mohamedawnallah commented Dec 11, 2024

What type of PR is this?

/kind flake

What this PR does / why we need it:

This PR reduce the probability of flaky test cases by waiting for cache to sync. It seems fake client has some flaky issues with watch implementation that is different from real implementation as seen in kubernetes/kubernetes#87472 (comment)?

Which issue(s) this PR fixes:
Fixes #5926

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Signed-off-by: Mohamed Awnallah <mohamedmohey2352@gmail.com>
@karmada-bot karmada-bot added the kind/flake Categorizes issue or PR as related to a flaky test. label Dec 11, 2024
@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 11, 2024
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.11%. Comparing base (8457cd2) to head (627b464).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5936   +/-   ##
=======================================
  Coverage   48.10%   48.11%           
=======================================
  Files         663      663           
  Lines       54769    54769           
=======================================
+ Hits        26349    26352    +3     
+ Misses      26711    26709    -2     
+ Partials     1709     1708    -1     
Flag Coverage Δ
unittests 48.11% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@XiShanYongYe-Chang
Copy link
Member

There is indeed such a setting in the controller startup, and I think this should work:

c.informerFactory.WaitForCacheSync(stopCh)

@zhzhuang-zju
Copy link
Contributor

I executed go test -v pkg/search/controller.go pkg/search/controllers_test.go --count=2000 twice in my environment, and it finally failed.

@XiShanYongYe-Chang
Copy link
Member

In terms of implementation, the WaitForCacheSync method should be called. There may be other problems with the ut failure. I feel that I can continue to push this pr and explore other problems.

Hi @mohamedawnallah can you revise it according to the comments?

@mohamedawnallah
Copy link
Contributor Author

I executed go test -v pkg/search/controller.go pkg/search/controllers_test.go --count=2000 twice in my environment, and it finally failed.

Thanks, @zhzhuang-zju, for testing this 2000 times, twice! You're right. I’ve also tried this, and as mentioned in the PR description, the proposed changes reduce the likelihood of flaky test cases. Before, the tests were failing more frequently. It would be interesting to gather more empirical observations from the GitHub CI test failures as @XiShanYongYe-Chang pointed, if any. What are your thoughts?

@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented Dec 20, 2024

In terms of implementation, the WaitForCacheSync method should be called. There may be other problems with the ut failure. I feel that I can continue to push this pr and explore other problems.

Currently, the Start method on the controller is busy-waiting for a signal from stopCh in go wait.Until(c.worker, time.Second, stopCh), and during this process, it caches the next element every second. After testing this, I encountered some timing issues. We might need to introduce static sleep in the verification code to allow time for the busy-waiting goroutine to execute the caching. This could reduce control over the verification code in the test cases. For instance, when a cluster is upserted, we would want to cache it immediately after.

What do you think, @XiShanYongYe-Chang?

func (c *Controller) Start(stopCh <-chan struct{}) {
klog.Infof("Starting karmada search controller")
defer runtime.HandleCrash()
c.informerFactory.WaitForCacheSync(stopCh)
go wait.Until(c.worker, time.Second, stopCh)
go func() {
<-stopCh
genericmanager.StopInstance()
klog.Infof("Shutting down karmada search controller")
}()
}

@mohamedawnallah
Copy link
Contributor Author

Apologies for the delayed response to the PR feedback. I've been quite busy recently, but I’ve now added comments addressing all the feedback. I’d love to hear your thoughts on them. 🙏

@XiShanYongYe-Chang
Copy link
Member

We might need to introduce static sleep in the verification code to allow time for the busy-waiting goroutine to execute the caching.

I don't quite understand why we need to add static time to wait. Events are blocked waiting, and as long as events can be fetched from the queue, processing is executed. So I understand that the changes now are sufficient.

@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented Dec 20, 2024

I don't quite understand why we need to add static time to wait. Events are blocked waiting, and as long as events can be fetched from the queue, processing is executed. So I understand that the changes now are sufficient.

Sorry for not providing enough context! 🙏 I was referring to what happens if we use controller.Start(test.stopCh) but the changes now are sufficient as you said.

@XiShanYongYe-Chang
Copy link
Member

Oh! Thanks @mohamedawnallah

Apologies for the delayed response to the PR feedback.

It's okay. As long as you have time, you can always contribute.

@XiShanYongYe-Chang
Copy link
Member

/approve
/cc @zhzhuang-zju

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XiShanYongYe-Chang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkg/search/controller: flaky tests
5 participants