-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umbrella issue for flaky tests #10700
Comments
As mentioned in etcd community meeting, I happen to have already collected the flaky tests. Here they are. Flaky tests:
We fixed the following tests. If you see them in CI failure after the fix PR, please let us know.
|
another one txn cancel
|
semaphoreci sometimes ran out of the 20m limit. |
another TestBalancerUnderBlackholeKeepAliveWatch
|
TestCtlV3AuthFromKeyPerm
https://semaphoreci.com/etcd-io/etcd/branches/pull-request-10683/builds/5 |
TestBarrierSingleNode
|
|
|
|
|
|
Noticed these failures in two different PRs today (basically same failure point for both tests):
https://semaphoreci.com/etcd-io/etcd/branches/pull-request-10631/builds/14 for |
|
Fixed by #10800. |
=== RUN TestLeasingGetWithOpts |
This one is pretty new, only observed it in two very recent tests (might be related to recently merged PRs?) --- FAIL: TestKVPutError (0.18s) |
As reported under #10787 (I believe we should close it and track the test failure here)
|
This run *should* certainly pass, but it's consistently the one that fails with a regularity that essentially blocks the CI pipeline. Someone needs to take a look at etcd-io#10700, but in the meantime, the show must go on.
Just as a heads up, the linux-amd64-integration-4-cpu is now skipped as a temporary measure. From what I saw, it is usually the target that incurred the failures. In my book, this also increases the priority of resolving these flakes. |
@jingyih in response to #10700 (comment) these errors still appear today: https://semaphoreci.com/etcd-io/etcd/branches/pull-request-10903/builds/2
I found this issue which is perhaps related; creack/pty#21 Edit (jingyih@, Fri 19 Jul 2019 05:47:20 PM PDT): fixed by #10908. |
^- I pulled on this thread because these tests seem to somehow be related to version compatibility, and found that I broke this #10889. I probably didn't notice because the semaphore tests were failing for unrelated reasons. Will investigate more. I assume there's a better error message that just gets lost somehow. PS I am using |
Ha, took me 45 minutes to figure out how to get the output:
Now that I have it, here we go:
This indeed introduced in #10889. Now to find out whether the test is doing something silly. |
Still poking, but confirming that at least from the vantage point of raft there's a voter being removed:
|
Ok I got it, this shady behavior on the part of etcd. There's this code in Lines 573 to 580 in 905aefc
Unfortunately that code removes before it adds, so if our node is B and the config is A, then it will first remove A and then add B, ending up with something empty in the middle. It needs to add everyone first, then remove. I'll send a patch. |
It created a sequence of conf changes that could intermittently cause an empty set of voters, which Raft asserts against as of etcd-io#10889. This fixes TestCtlV2BackupSnapshot and TestCtlV2BackupV3Snapshot, see: etcd-io#10700 (comment)
It created a sequence of conf changes that could intermittently cause an empty set of voters, which Raft asserts against as of etcd-io#10889. This fixes TestCtlV2BackupSnapshot and TestCtlV2BackupV3Snapshot, see: etcd-io#10700 (comment)
It created a sequence of conf changes that could intermittently cause an empty set of voters, which Raft asserts against as of etcd-io#10889. This fixes TestCtlV2BackupSnapshot and TestCtlV2BackupV3Snapshot, see: etcd-io#10700 (comment)
It created a sequence of conf changes that could intermittently cause an empty set of voters, which Raft asserts against as of etcd-io#10889. This fixes TestCtlV2BackupSnapshot and TestCtlV2BackupV3Snapshot, see: etcd-io#10700 (comment)
Let's continue this on #10979 |
The current test CI is very flaky. We need to start to fix them.
I will look over the recent CI failures and post the flaky test cases here.
The text was updated successfully, but these errors were encountered: