roachtest: clearrange/checks=false failed #38095

cockroach-teamcity · 2019-06-07T07:29:09Z

SHA: https://github.com/cockroachdb/cockroach/commits/b83798cadfee6447d565688b58657843741f8a45

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1328389&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1513,clearrange.go:59,clearrange.go:34,test.go:1248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1328389-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=0 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190607 06:56:29.891228 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.243.244.95_2019-06-07T06:56:29Z: exit status 1
		: exit status 1

The text was updated successfully, but these errors were encountered:

tbg · 2019-06-19T13:55:08Z

Import failure. cc @dt

tbg · 2019-06-19T13:55:19Z

(Closing bc pretty sure nothing will be learned from this error message)

tbg · 2019-06-19T13:58:28Z

Oh actually this is in the logs for n9

W190607 07:21:23.398171 2482 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
W190607 07:21:23.902685 2482 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
W190607 07:21:24.124494 2482 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
W190607 07:21:24.152291 2482 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
W190607 07:21:24.978756 2482 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
W190607 07:21:25.641529 2426 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
W190607 07:21:25.706968 2482 storage/engine/rocksdb.go:119  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216

and there's plenty of underreplicated ranges. No node died, though.

38433: sql: fix swallowing Unsplit error r=jeffrey-xiao a=jeffrey-xiao Realized this condition should be negated. Fixes #38131, #38095. Release note: None Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>

ajkr · 2019-06-27T23:22:23Z

Oh, slow flush. Reminds me the temp store and the main store are still sharing a single flush thread.

ajkr · 2019-06-27T23:40:50Z

Actually all the nodes' logs are filled with write stall messages. I guess we are writing very fast to the temp store. We changed temp store recently to never sync the WAL, so writes are expected to have less natural backpressure. Instead they'll eventually hit RocksDB's limits and get artificially backpressured (which happens to print a warning, unlike natural backpressure).

Anyways, I expect separating the thread pools is still worthwhile, but won't prevent the warning from being printed. I also am not sure the original "context canceled" error message is related.

cockroach-teamcity · 2019-06-28T05:22:20Z

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190627 22:50:28.767869 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

cockroach-teamcity · 2019-06-28T13:08:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190628 06:36:52.614033 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

cockroach-teamcity · 2019-06-30T13:09:45Z

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190630 06:38:02.838072 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

cockroach-teamcity · 2019-07-01T13:09:04Z

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1368144-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190701 06:37:20.135282 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

tbg · 2019-07-02T13:39:43Z

Just looked into one failure. Everything looked pretty quiet but requests got stuck, first on the quota pool and then on latches (probably because a prior request got stuck on the quota pool)

https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=artifacts#%2Fclearrange%2Fchecks%3Dfalse%2Fdebug.zip!%2Fdebug%2Fnodes

I think @nvanbenschoten is looking into this general issue which is thought to be fallout from #38343. Nathan, think this is the same thing?

nvanbenschoten · 2019-07-02T15:48:04Z

Yes, the request getting stuck on the quota pool is what I'm looking into now. I think I have a fix, but it is taking some time to confirm.

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

cockroach-teamcity · 2019-07-03T02:01:53Z

SHA: https://github.com/cockroachdb/cockroach/commits/2c865eeb3e3b244468ffc509a62778bd1f46740f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1370685&tab=buildLog

The test failed on branch=provisional_201907021644_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1587,cluster.go:1606,cluster.go:1710,clearrange.go:107,clearrange.go:156,cluster.go:1849,errgroup.go:57: context canceled
	cluster.go:1870,clearrange.go:184,clearrange.go:32,test.go:1249: Goexit() was called

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

cockroach-teamcity added this to the 19.2 milestone Jun 7, 2019

cockroach-teamcity assigned tbg Jun 7, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jun 7, 2019

tbg closed this as completed Jun 19, 2019

tbg reopened this Jun 19, 2019

tbg assigned dt and ajkr and unassigned tbg Jun 19, 2019

This comment has been minimized.

Sign in to view

jeffrey-xiao self-assigned this Jun 26, 2019

jeffrey-xiao mentioned this issue Jun 26, 2019

sql: fix swallowing Unsplit error #38433

Merged

This comment has been minimized.

Sign in to view

tbg mentioned this issue Jul 1, 2019

roachtest: clearrange/checks=true failed #38131

Closed

nvanbenschoten mentioned this issue Jul 3, 2019

storage: release quota on failed Raft proposals #38632

Merged

craig bot closed this as completed in #38632 Jul 3, 2019

nvanbenschoten mentioned this issue Jul 9, 2019

roachtest: clearrange/checks=false failed #38772

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: clearrange/checks=false failed #38095

roachtest: clearrange/checks=false failed #38095

cockroach-teamcity commented Jun 7, 2019

tbg commented Jun 19, 2019

tbg commented Jun 19, 2019

tbg commented Jun 19, 2019

This comment has been minimized.

This comment has been minimized.

ajkr commented Jun 27, 2019

ajkr commented Jun 27, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jun 30, 2019

cockroach-teamcity commented Jul 1, 2019

tbg commented Jul 2, 2019

nvanbenschoten commented Jul 2, 2019

cockroach-teamcity commented Jul 3, 2019

roachtest: clearrange/checks=false failed #38095

roachtest: clearrange/checks=false failed #38095

Comments

cockroach-teamcity commented Jun 7, 2019

tbg commented Jun 19, 2019

tbg commented Jun 19, 2019

tbg commented Jun 19, 2019

This comment has been minimized.

This comment has been minimized.

ajkr commented Jun 27, 2019

ajkr commented Jun 27, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jun 30, 2019

cockroach-teamcity commented Jul 1, 2019

tbg commented Jul 2, 2019

nvanbenschoten commented Jul 2, 2019

cockroach-teamcity commented Jul 3, 2019