-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestReplicateQueueDeadNonVoters and TestReplicateQueueSwapVotersWithNonVoters timing out #65932
Comments
Refs: cockroachdb#65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None
Refs: cockroachdb#65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None
65867: changefeedccl: Fix flaky tests. r=miretskiy a=miretskiy Fix flaky test and re-enable it to run under stress. The problem was that the transaction executed by the table feed can be restarted. If that happens, then we would see the same keys again, but because we had side effects inside transaction (marking the keys seen), we would not emit those keys causing the test to be hung. The stress race was failing because of both transaction restarts and the 10ms resolved timestamp frequency (with so many resolved timestamps being generated, the table feed transaction was always getting restarted). Fixes #57754 Fixes #65168 Release Notes: None 65868: storage: expose pebble.IteratorStats through {MVCC,Engine}Iterator r=sumeerbhola a=sumeerbhola These will potentially be aggregated before exposing in trace statements, EXPLAIN ANALYZE etc. Release note: None 65900: roachtest: fix ruby-pg test suite r=rafiss a=RichardJCai Update blocklist with passing test. The not run test causing a failure is because the test is no longer failing. Since it is not failing, it shows up under not run. Release note: None 65910: sql/gcjob: retry failed GC jobs r=ajwerner a=sajjadrizvi In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (#44594). Release note: None Fixes: #65000 Release note (<category, see below>): <what> <show> <why> 65925: ccl/importccl: skip TestImportPgDumpSchemas/inject-error-ensure-cleanup r=tbg a=adityamaru Refs: #65878 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65933: kv/kvserver: skip TestReplicateQueueDeadNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65934: kv/kvserver: skip TestReplicateQueueSwapVotersWithNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65936: jobs: fix flakey TestMetrics r=fqazi a=ajwerner Fixes #65735 The test needed to wait for the job to be fully marked as paused. Release note: None Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: richardjcai <caioftherichard@gmail.com> Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com> Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>
When I attempt to reproduce these failures under high concurrency, these tests simply do not make progress due to either node liveness getting wedged or gossip getting wedged. Under low concurrency (i.e. with Funnily enough, almost all the other These seem to be another instance of spurious testrace timeouts that are a result of CI machines being too pegged. @tbg @erikgrinaker I'd like to understand how y'all feel about declaring bankruptcy here again. It's unsatisfying, but I'm also not sure there's much else to be done. |
What happens if you make raft timeouts effectively not happen by setting the timeout to something very large? I often wonder whether setting these timeouts to be extremely large out to be the norm under stress/stressrace. --- a/pkg/kv/kvserver/replicate_queue_test.go
+++ b/pkg/kv/kvserver/replicate_queue_test.go
@@ -522,6 +522,9 @@ func TestReplicateQueueDeadNonVoters(t *testing.T) {
base.TestClusterArgs{
ReplicationMode: base.ReplicationAuto,
ServerArgs: base.TestServerArgs{
+ RaftConfig: base.RaftConfig{
+ RaftElectionTimeoutTicks: 1000,
+ },
Knobs: base.TestingKnobs{
NodeLiveness: kvserver.NodeLivenessTestingKnobs{
StorePoolNodeLivenessFn: func( |
That's a good point. I tried with
We have a ton of these tests that create some scenario and drain the |
Running timing-sensitive tests with >= 3 nodes under race has never worked well for me, and tends to create far more noise than signal. Skipping them under race seems fine to me. |
66487: kvserver: skip TestReplicateQueueDecommissioningNonVoters under race r=aayushshah15 a=aayushshah15 These tests frequently time out under race when our CI machines are too pegged. See discussion on the linked issue. Closes #65932 Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Under race
https://teamcity.cockroachdb.com/viewLog.html?buildId=3036495&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Testrace
https://teamcity.cockroachdb.com/viewLog.html?buildId=3036336&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Testrace
https://teamcity.cockroachdb.com//viewLog.html?buildId=3034938&buildTypeId=Cockroach_MergedExtendedCi
The text was updated successfully, but these errors were encountered: