Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestReplicateQueueDownReplicate timed out under stress #32256

Closed
cockroach-teamcity opened this issue Nov 13, 2018 · 2 comments
Closed
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/96aafe6226579176f496dfadae78b52d687c3faa

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=(unknown) PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1009150&tab=buildLog

Slow failing tests:
TestAmbiguousResultErrorOnRetry/non-txn-put - 0.00s

Slow passing tests:
TestAllocatorFullDisks - 11.05s
TestReplicaCommandQueueCancellation - 5.26s
TestGCQueueChunkRequests - 4.45s
TestReplicaCommandQueueCancellationCascade - 2.94s
TestReplicaCommandQueueCancellationRandom - 2.75s
TestGCQueueMakeGCScoreInvariantQuick - 2.63s
TestReplicaCommandQueueCancellationLocal - 1.52s
TestProactiveRaftLogTruncate - 1.39s
TestRaftSSTableSideloadingProposal - 1.32s
TestReplicaCommandQueue - 0.68s

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Nov 13, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Nov 13, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b775ca0318125c996305e4ab560e75d2d3471547

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=(unknown) PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1037228&tab=buildLog

Slow failing tests:
TestReplicateQueueDownReplicate - 0.26s

Slow passing tests:
TestMergeQueue - 105.00s
TestReplicateQueueRebalance - 72.65s
TestRemovePlaceholderRace - 16.94s
TestStoreRangeSplitBackpressureWrites - 14.36s
TestReplicateQueueUpReplicate - 13.05s
TestSplitTriggerRaftSnapshotRace - 12.18s
TestGossipHandlesReplacedNode - 11.24s
TestAllocatorFullDisks - 10.91s
TestWedgedReplicaDetection - 10.87s
TestSnapshotRaftLogLimit - 9.09s
TestStoreSplitFailsAfterMaxRetries - 7.62s
TestReplicaCommandQueueCancellation - 5.21s
TestConsistencyQueueRecomputeStats - 4.37s
TestGCQueueChunkRequests - 4.25s
TestStoreRangeMergeRaftSnapshot - 4.17s
TestNodeLivenessStatusMap - 4.06s
TestGossipFirstRange - 3.88s
TestLogSplits - 3.72s
TestStoreRangeMergeUninitializedLHSFollower - 3.45s
TestStoreSplitOnRemovedReplica - 3.34s

@tbg tbg changed the title storage: package timed out under stress storage: TestReplicateQueueDownReplicate timed out under stress Dec 3, 2018
@tbg
Copy link
Member

tbg commented Dec 3, 2018

Basically spent close to half an hour like this:

goroutine 122477 [select]:
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc435eea8d8, 0xc435eea898)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:120 +0x200
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication(0xc420f41ab0, 0x3ea8720, 0xc424878ff0)
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:552 +0x2f0
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster(0x3f0e4e0, 0xc43657c780, 0x5, 0x3ea6260, 0xc420796ee0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:172 +0xa03
github.com/cockroachdb/cockroach/pkg/storage_test.TestReplicateQueueDownReplicate(0xc43657c780)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replicate_queue_test.go:195 +0x1a4
testing.tRunner(0xc43657c780, 0x37201a0)
	/usr/local/go/src/testing/testing.go:777 +0x16e
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:824 +0x565

Perhaps something is getting starved by the scanner timings?

	tc := testcluster.StartTestCluster(t, replicaCount+2,
		base.TestClusterArgs{
			ReplicationMode: base.ReplicationAuto,
			ServerArgs: base.TestServerArgs{
				ScanMinIdleTime: time.Millisecond,
				ScanMaxIdleTime: time.Millisecond,
				Knobs: base.TestingKnobs{
					Store: &storage.StoreTestingKnobs{
						// Prevent the merge queue from immediately discarding our splits.
						DisableMergeQueue: true,
					},
				},
			},
		},
	)

tbg added a commit to tbg/cockroach that referenced this issue Dec 11, 2018
Before this change,

> make test PKG=./pkg/storage/ TESTS=TestReplicateQueueDownReplicate
> TESTFLAGS='-count 10'

takes ~109s on my laptop. After this change, it takes ~18s.

This is in line with a test failure on CI which looked like the test
had just never managed to schedule the goroutines that matter in time
for things to wrap up.

Fixes cockroachdb#32256.

Release note: None
@tbg tbg assigned tbg and unassigned andreimatei Dec 11, 2018
craig bot pushed a commit that referenced this issue Dec 11, 2018
33013: storage: un-starve TestReplicateQueueDownReplicate r=petermattis a=tbg

Before this change,

> make test PKG=./pkg/storage/ TESTS=TestReplicateQueueDownReplicate
> TESTFLAGS='-count 10'

takes ~109s on my laptop. After this change, it takes ~18s.

This is in line with a test failure on CI which looked like the test
had just never managed to schedule the goroutines that matter in time
for things to wrap up.

Fixes #32256.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in #33013 Dec 11, 2018
tbg added a commit to tbg/cockroach that referenced this issue Dec 29, 2018
Before this change,

> make test PKG=./pkg/storage/ TESTS=TestReplicateQueueDownReplicate
> TESTFLAGS='-count 10'

takes ~109s on my laptop. After this change, it takes ~18s.

This is in line with a test failure on CI which looked like the test
had just never managed to schedule the goroutines that matter in time
for things to wrap up.

Fixes cockroachdb#32256.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants