Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestReplicateQueueDeadNonVoters and TestReplicateQueueSwapVotersWithNonVoters timing out #65932

Closed
sumeerbhola opened this issue Jun 1, 2021 · 4 comments · Fixed by #66487
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). skipped-test

Comments

@sumeerbhola sumeerbhola added C-test-failure Broken test (automatically or manually discovered). branch-master Failures and bugs on the master branch. skipped-test labels Jun 1, 2021
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 1, 2021
Refs: cockroachdb#65932

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 1, 2021
Refs: cockroachdb#65932

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
@aayushshah15 aayushshah15 self-assigned this Jun 1, 2021
craig bot pushed a commit that referenced this issue Jun 1, 2021
65867: changefeedccl: Fix flaky tests. r=miretskiy a=miretskiy

Fix flaky test and re-enable it to run under stress.
The problem was that the transaction executed by the table feed can
be restarted.  If that happens, then we would see the same keys again,
but because we had side effects inside transaction (marking the keys
seen), we would not emit those keys causing the test to be hung.
The stress race was failing because of both transaction restarts and
the 10ms resolved timestamp frequency (with so many resolved timestamps
being generated, the table feed transaction was always getting
restarted).

Fixes #57754
Fixes #65168

Release Notes: None

65868: storage: expose pebble.IteratorStats through {MVCC,Engine}Iterator r=sumeerbhola a=sumeerbhola

These will potentially be aggregated before exposing in trace
statements, EXPLAIN ANALYZE etc.

Release note: None

65900: roachtest: fix ruby-pg test suite r=rafiss a=RichardJCai

Update blocklist with passing test.
The not run test causing a failure is because the test is no longer failing.
Since it is not failing, it shows up under not run.

Release note: None

65910: sql/gcjob: retry failed GC jobs r=ajwerner a=sajjadrizvi

In the previous implementation, failed GC jobs were not being retried regardless
whether the failure is permanent or transient. As a result, a GC job's failure
risked orphaned data, which cannot be reclaimed.

This commit adds a mechanism to retry failed GC jobs that are not permanent. No
limit is set on the number of retries. For the time being, the failure type is
determined based on the failure categorization of schema-change jobs. This
behavior is expected to change once exponential backoff mechanism is
implemented for failed jobs (#44594).

Release note: None

Fixes: #65000

Release note (<category, see below>): <what> <show> <why>

65925: ccl/importccl: skip TestImportPgDumpSchemas/inject-error-ensure-cleanup r=tbg a=adityamaru

Refs: #65878

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

65933: kv/kvserver: skip TestReplicateQueueDeadNonVoters under race r=sumeerbhola a=sumeerbhola

Refs: #65932

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

65934: kv/kvserver: skip TestReplicateQueueSwapVotersWithNonVoters under race r=sumeerbhola a=sumeerbhola

Refs: #65932

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

65936: jobs: fix flakey TestMetrics r=fqazi a=ajwerner

Fixes #65735

The test needed to wait for the job to be fully marked as paused.

Release note: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: richardjcai <caioftherichard@gmail.com>
Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
@aayushshah15
Copy link
Contributor

When I attempt to reproduce these failures under high concurrency, these tests simply do not make progress due to either node liveness getting wedged or gossip getting wedged. Under low concurrency (i.e. with -p 2 or -p 4 on beefy roachprod machines), I cannot get any failures/timeouts whatsoever.

Funnily enough, almost all the other TestReplicateQueue... tests in that file are already skipped for the same reason and the one that isn't (TestReplicateQueueDecommissioningNonVoters) is failing the exact same way like the two tests this issue is about.

These seem to be another instance of spurious testrace timeouts that are a result of CI machines being too pegged. @tbg @erikgrinaker I'd like to understand how y'all feel about declaring bankruptcy here again. It's unsatisfying, but I'm also not sure there's much else to be done.

@ajwerner
Copy link
Contributor

What happens if you make raft timeouts effectively not happen by setting the timeout to something very large? I often wonder whether setting these timeouts to be extremely large out to be the norm under stress/stressrace.

--- a/pkg/kv/kvserver/replicate_queue_test.go
+++ b/pkg/kv/kvserver/replicate_queue_test.go
@@ -522,6 +522,9 @@ func TestReplicateQueueDeadNonVoters(t *testing.T) {
                        base.TestClusterArgs{
                                ReplicationMode: base.ReplicationAuto,
                                ServerArgs: base.TestServerArgs{
+                                       RaftConfig: base.RaftConfig{
+                                               RaftElectionTimeoutTicks: 1000,
+                                       },
                                        Knobs: base.TestingKnobs{
                                                NodeLiveness: kvserver.NodeLivenessTestingKnobs{
                                                        StorePoolNodeLivenessFn: func(

@aayushshah15
Copy link
Contributor

aayushshah15 commented Jun 15, 2021

That's a good point.

I tried with RaftElectionTimeoutTicks: 1000 but kept running into a bunch of other slowness that impedes the TestCluster setup. For instance, with ReplicationAuto, the test cluster will try to upreplicate before returning from its Start() method. However, it's hard to make this succeed, as the replicate queue keeps butting its head against a bunch of our context cancellation deadlines.

kv/kvserver/queue.go:1093  [n1,raftsnapshot,s1,r4/1:/System{/tsd-tse}] 2809  operation "raftsnapshot queue process replica 4" timed out after 1m0s: context deadline exceeded
...
[n2,replicate,s2,r19/2:/Table/2{3-4}] 2894  failed to rollback LEARNER n3,s3, abandoning it for the replicate queue: operation "learner rollback" timed out after 10s: change replicas of r19 failed: log-range-event: context deadline exceeded

We have a ton of these tests that create some scenario and drain the {replicate,merge,split}Queue to assert that the queue actually took some action, and its hard to see how we could make these tests non-flakey if they're always going to be running on overloaded CI machines.

@erikgrinaker
Copy link
Contributor

These seem to be another instance of spurious testrace timeouts that are a result of CI machines being too pegged. @tbg @erikgrinaker I'd like to understand how y'all feel about declaring bankruptcy here again. It's unsatisfying, but I'm also not sure there's much else to be done.

Running timing-sensitive tests with >= 3 nodes under race has never worked well for me, and tends to create far more noise than signal. Skipping them under race seems fine to me.

craig bot pushed a commit that referenced this issue Jun 15, 2021
66487: kvserver: skip TestReplicateQueueDecommissioningNonVoters under race r=aayushshah15 a=aayushshah15

These tests frequently time out under race when our CI machines are too pegged.
See discussion on the linked issue.

Closes #65932

Release note: None

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
@craig craig bot closed this as completed in c9849d6 Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants