-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta] Fix random test failures #1715
Comments
I did a quick experiment overnight on my dev machine where I ran the
Results:
All 3 failures were caused by "Suite timeout exceeded (>= 1200000 msec)." From this I'll make a couple hypotheses:
I'm going to repeat my experiment but run the full Dev environment:
|
Another flaky test:
|
Looking into it. |
A simple plan to begin with can involve below steps:
|
I think it is a good idea to collect this data. It might be a bit hard to separate out the failures that were caused by the change in the PR that triggered the build. Setting up a test machine to run checks continually should be able to get similar data, and will have the benefit of running against a static code base.
We've probably seen enough of these to know they aren't reproducable when re-run in isolation. We have open issues with quite a few errors and none of them can be reproduced even when re-running the individual test many many times. I think running the entire test suite is the way to go, but we probably don't need to worry about the Jenkins stuff and can just trigger the |
Another one, coming from: #1766
|
I ran another experiment over the weekend, the theory being that maybe
but the results were 7 failures out of 330, which is in line with the ~2% failure rate of the integ tests in isolation. The failures were:
There are likely bugs within ClusterHealthIT, ShardIndexingPressureIT, and ShardIndexingPressureSettingsIT that cause rare failures. But it remains a mystery what is causing |
/cc @getsaurabh02 ShardIndexingPressureSettingsIT is a problem child. Can y'all investigate the recurring |
Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843 |
👍 Also note open PR #1592 |
I copied some links into the body of this issue... it's quite a list. |
another one #2176. |
Between gradle check 6786 and 6688 (100 builds) the following tests failed more than once:
Another ~100 failed once. |
I am targeting to close these flakey tests down to zero by Dec 30, 2022. Please if anyone want to help in this effort, feel free to pick one of the flakey test issues in this list |
I wrote a script to crawl the Jenkins output for unstable builds: https://gist.github.com/andrross/ee07a8a05beb63f1173bcb98523918b9 Below are the results for the last 1000 builds. There is a long tail of tests with a few failures, but the top 4 failures have issues already (#5219, #4212, #5157, #3603).
|
@andrross I swear I wrote very similar code to produce #1715 (comment), but where did I put it? :) thank you! |
Thanks @andrross for the script. I ran @andrross script's to get all flaky tests from past 2 months. (From Sep 30 2022 - Dec 5 2022). Here is the List of 104 flaky tests found:
|
How flaky acceptable? I closed #6739 after calculating the expected failure rate of a random-alpha-of-length-5 collision at 1 in 19,164. It failed once on run 12,467. It'll probably fail again in a few years. Is that OK? |
Closing this campaign. |
PRs were blocked by transient gradle check errors multiple times. Provide a plan to stabilize the tests.
The text was updated successfully, but these errors were encountered: