Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/splits/nodes=3/quiesce=false failed #31125

Closed
cockroach-teamcity opened this issue Oct 9, 2018 · 8 comments
Closed

roachtest: kv/splits/nodes=3/quiesce=false failed #31125

cockroach-teamcity opened this issue Oct 9, 2018 · 8 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/dc0e73c728e533fdb3bec63e53eec174e920ff22

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=952220&tab=buildLog

The test failed on release-2.1:
	test.go:570,cluster.go:975,kv.go:192,cluster.go:1297,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-952220-kv-splits-nodes-3-quiesce-false:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
		stderr:
		
		stdout:
		) TO (-5752107573081187328): dial tcp 10.128.0.29:26257: connect: connection refused
		W181009 08:09:06.229009 171 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-5671569249532018688) TO (-5671569249532018688): dial tcp 10.128.0.29:26257: connect: connection refused
		W181009 08:09:06.230111 30 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-5625710735481804800) TO (-5625710735481804800): dial tcp 10.128.0.29:26257: connect: connection refused
		W181009 08:09:06.230345 322 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-5759412469124583424) TO (-5759412469124583424): dial tcp 10.128.0.29:26257: connect: connection refused
		W181009 08:09:06.230405 250 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-5689204301596381184) TO (-5689204301596381184): dial tcp 10.128.0.29:26257: connect: connection refused
		W181009 08:09:06.230571 120 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-5763802785433494528) TO (-5763802785433494528): dial tcp 10.128.0.29:26257: connect: connection refused
		: signal: killed
	test.go:570,cluster.go:1318,kv.go:195: unexpected node event: 1: dead

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Oct 9, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 9, 2018
@tbg
Copy link
Member

tbg commented Oct 9, 2018

Fell apart as expected.

fatal error: runtime: out of memory is the final outcome.

What's interesting is that the closed timestamp code seems to really matter:

Total: 5.81GB
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/storage/closedts/storage.merge in /go/src/github.com/cockroachdb/cockroach/pkg/storage/closedts/storage/storage_mem.go
    1.16GB     1.16GB (flat, cum) 19.92% of Total
         .          .    179:
         .          .    180:	// Initialize re as a deep copy of e.
         .          .    181:	re := e
         .          .    182:	re.MLAI = map[roachpb.RangeID]ctpb.LAI{}
         .          .    183:	for rangeID, mlai := range e.MLAI {
    1.16GB     1.16GB    184:		re.MLAI[rangeID] = mlai
         .          .    185:	}
         .          .    186:	// The result is full if either operand is.
         .          .    187:	re.Full = e.Full || ee.Full
         .          .    188:
         .          .    189:	// Use the larger of both timestamps with the union of the MLAIs, preferring larger

This is somewhat unexpected because this code should scale like the number of active ranges (active as in, max lease index grows). But perhaps, due to the lease requests that no doubt all ranges are trying to send due to the cluster falling over, the LAIs actually grow.

Two take-aways here: that code should try to allocate less (or at least bulk allocate the map), and we should look into not assigning a LAI for LeaseRequest (which we do today just because).

@tbg tbg self-assigned this Oct 9, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Oct 9, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4d878135975265d185b4d0011ba1d2dbaee0adcb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=970685&tab=buildLog

The test failed on release-2.1:
	test.go:606,cluster.go:1098,kv.go:192,cluster.go:1420,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-970685-kv-splits-nodes-3-quiesce-false:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
		stderr:
		
		stdout:
		6659463168) TO (5466884586659463168): dial tcp 10.128.0.23:26257: connect: connection refused
		W181017 11:39:38.451139 432 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (5404829863704950784) TO (5404829863704950784): dial tcp 10.128.0.23:26257: connect: connection refused
		W181017 11:39:38.451163 350 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (3583586463794405376) TO (3583586463794405376): dial tcp 10.128.0.23:26257: connect: connection refused
		W181017 11:39:38.451232 623 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (3541712438495135744) TO (3541712438495135744): dial tcp 10.128.0.23:26257: connect: connection refused
		W181017 11:39:38.451253 281 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (3587792313031514112) TO (3587792313031514112): dial tcp 10.128.0.23:26257: connect: connection refused
		W181017 11:39:38.451356 579 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (5451278672384933888) TO (5451278672384933888): dial tcp 10.128.0.23:26257: connect: connection refused
		: signal: killed
	test.go:606,cluster.go:1441,kv.go:195: unexpected node event: 1: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a4950e75f6592b8cbea217cf1392d8ef8961130f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985074&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,kv.go:192,cluster.go:1432,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-985074-kv-splits-nodes-3-quiesce-false:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
		stderr:
		
		stdout:
		d.go:463  finished 193000 of 500000 splits
		I181025 17:39:59.673130 630 workload/workload.go:463  finished 194000 of 500000 splits
		I181025 17:50:15.913372 630 workload/workload.go:463  finished 195000 of 500000 splits
		I181025 17:59:36.551811 630 workload/workload.go:463  finished 196000 of 500000 splits
		I181025 18:10:34.616813 630 workload/workload.go:463  finished 197000 of 500000 splits
		I181025 18:23:53.716501 630 workload/workload.go:463  finished 198000 of 500000 splits
		I181025 18:36:52.615923 630 workload/workload.go:463  finished 199000 of 500000 splits
		I181025 18:45:42.393345 630 workload/workload.go:463  finished 200000 of 500000 splits
		I181025 18:55:50.257909 630 workload/workload.go:463  finished 201000 of 500000 splits
		I181025 19:13:03.347573 630 workload/workload.go:463  finished 202000 of 500000 splits
		Error: ALTER TABLE kv SPLIT AT VALUES (7935349154697844736): pq: split at key /Table/53/1/7935349154697844736 failed: aborted during DistSender.Send: context deadline exceeded
		Error:  exit status 1
		: exit status 1
	test.go:639,cluster.go:1453,kv.go:195: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,kv.go:192,cluster.go:1432,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-993605-kv-splits-nodes-3-quiesce-false:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
		stderr:
		
		stdout:
		358 624 workload/workload.go:463  finished 197000 of 500000 splits
		I181030 21:27:49.786764 624 workload/workload.go:463  finished 198000 of 500000 splits
		I181030 21:32:52.077938 624 workload/workload.go:463  finished 199000 of 500000 splits
		I181030 21:38:02.939369 624 workload/workload.go:463  finished 200000 of 500000 splits
		I181030 21:42:50.621072 624 workload/workload.go:463  finished 201000 of 500000 splits
		I181030 21:47:35.693962 624 workload/workload.go:463  finished 202000 of 500000 splits
		I181030 21:52:20.087867 624 workload/workload.go:463  finished 203000 of 500000 splits
		I181030 21:56:59.079518 624 workload/workload.go:463  finished 204000 of 500000 splits
		I181030 22:02:11.824956 624 workload/workload.go:463  finished 205000 of 500000 splits
		I181030 22:07:03.557543 624 workload/workload.go:463  finished 206000 of 500000 splits
		I181030 22:12:23.196524 624 workload/workload.go:463  finished 207000 of 500000 splits
		I181030 22:17:30.329566 624 workload/workload.go:463  finished 208000 of 500000 splits
		: signal: killed
	test.go:639,cluster.go:1453,kv.go:195: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1118,kv.go:192,cluster.go:1440,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-995412-kv-splits-nodes-3-quiesce-false:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
		stderr:
		
		stdout:
		14:10:04.480921 612 workload/workload.go:463  finished 77000 of 500000 splits
		I181031 14:11:12.255938 612 workload/workload.go:463  finished 78000 of 500000 splits
		I181031 14:12:21.888991 612 workload/workload.go:463  finished 79000 of 500000 splits
		I181031 14:13:33.087480 612 workload/workload.go:463  finished 80000 of 500000 splits
		I181031 14:14:40.497472 612 workload/workload.go:463  finished 81000 of 500000 splits
		I181031 14:15:52.769111 612 workload/workload.go:463  finished 82000 of 500000 splits
		I181031 14:17:04.532139 612 workload/workload.go:463  finished 83000 of 500000 splits
		I181031 14:18:20.864020 612 workload/workload.go:463  finished 84000 of 500000 splits
		I181031 14:19:38.910141 612 workload/workload.go:463  finished 85000 of 500000 splits
		I181031 14:20:59.837932 612 workload/workload.go:463  finished 86000 of 500000 splits
		I181031 14:22:24.861909 612 workload/workload.go:463  finished 87000 of 500000 splits
		I181031 14:23:48.822468 612 workload/workload.go:463  finished 88000 of 500000 splits
		: signal: interrupt
	test.go:639,cluster.go:1461,kv.go:195: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c196a52e8ec61851e32a062c13135bbb4111eb96

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=997430&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1118,kv.go:192,cluster.go:1440,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-997430-kv-splits-nodes-3-quiesce-false:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
		stderr:
		
		stdout:
		514 630 workload/workload.go:463  finished 209000 of 500000 splits
		I181101 21:09:35.951275 630 workload/workload.go:463  finished 210000 of 500000 splits
		I181101 21:16:02.111640 630 workload/workload.go:463  finished 211000 of 500000 splits
		I181101 21:21:48.229355 630 workload/workload.go:463  finished 212000 of 500000 splits
		I181101 21:27:39.739471 630 workload/workload.go:463  finished 213000 of 500000 splits
		I181101 21:33:50.393498 630 workload/workload.go:463  finished 214000 of 500000 splits
		I181101 21:39:11.030864 630 workload/workload.go:463  finished 215000 of 500000 splits
		I181101 21:44:17.881253 630 workload/workload.go:463  finished 216000 of 500000 splits
		I181101 21:49:22.739587 630 workload/workload.go:463  finished 217000 of 500000 splits
		I181101 21:54:06.301112 630 workload/workload.go:463  finished 218000 of 500000 splits
		I181101 21:59:05.189231 630 workload/workload.go:463  finished 219000 of 500000 splits
		I181101 22:03:45.464244 630 workload/workload.go:463  finished 220000 of 500000 splits
		: signal: killed
	test.go:639,cluster.go:1461,kv.go:195: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/06d2222fd9010f01a8cdf6a6c24597bbed181f36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1044826&tab=buildLog

The test failed on master:
	test.go:630,test.go:642,cluster.go:1065,kv.go:178: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --env=COCKROACH_MEMPROF_INTERVAL=1m --env=COCKROACH_DISABLE_QUIESCENCE=true --args=--cache=256MiB teamcity-1044826-kv-splits-nodes-3-quiesce-false:1-3 --encrypt returned:
		stderr:
		
		stdout:
		export ROACHPROD=3 && COCKROACH_DISABLE_QUIESCENCE=true ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-central1,zone=us-central1-b --join=35.225.198.216:26257 --enterprise-encryption=path=/mnt/data1/cockroach,key=/mnt/data1/cockroach/aes-128.key,old-key=plain --cache=256MiB >> ${HOME}/logs/cockroach.stdout 2>> ${HOME}/logs/cockroach.stderr || (x=$?; cat ${HOME}/logs/cockroach.stderr; exit $x)
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:399
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1155
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:2361: 
		2018/12/07 13:36:51 command failed
		: exit status 1

@tbg
Copy link
Member

tbg commented Dec 11, 2018

#30832

@tbg tbg closed this as completed Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

2 participants