roachtest: tpchbench/tpch/nodes=3/cpu=4/sf=1 failed #38578

cockroach-teamcity · 2019-06-30T12:27:50Z

SHA: https://github.com/cockroachdb/cockroach/commits/5968fa55d48966866eac1cb744c61d0aa6284ae7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpchbench/tpch/nodes=3/cpu=4/sf=1 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367354&tab=buildLog

The test failed on branch=provisional_201906292257_v19.2.0-alpha.20190701, cloud=gce:
	tpchbench.go:119,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367354-tpchbench-tpch-nodes-3-cpu-4-sf-1:4 -- ./workload run querybench --db=tpch --concurrency=1 --query-file=tpch --num-runs=3 --max-ops=63 --vectorized=false {pgurl:1-3} --histograms=logs/stats.json --histograms-max-latency=8m20s returned:
		stderr:
		
		stdout:
		s = 'F' AND l1.l_receiptDATE > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptDATE > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = 'SAUDI ARABIA' GROUP BY s_name ORDER BY numwait DESC, s_name LIMIT 100
		  39m41s        0            0.0            0.0      0.0      0.0      0.0      0.0 21: SELECT cntrycode, count(*) AS numcust, sum(c_acctbal) AS totacctbal FROM ( SELECT substring(c_phone FROM 1 FOR 2) AS cntrycode, c_acctbal FROM customer WHERE substring(c_phone FROM 1 FOR 2) in ('13', '31', '23', '29', '30', '18', '17') AND c_acctbal > ( SELECT avg(c_acctbal) FROM customer WHERE c_acctbal > 0.00 AND substring(c_phone FROM 1 FOR 2) in ('13', '31', '23', '29', '30', '18', '17')) AND NOT EXISTS ( SELECT * FROM orders WHERE o_custkey = c_custkey)) AS custsale GROUP BY cntrycode ORDER BY cntrycode
		: signal: killed
	cluster.go:1870,tpchbench.go:123,tpchbench.go:244,test.go:1249: unexpected node event: 1: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpchbench.go:123,tpchbench.go:244,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1367354-tpchbench-tpch-nodes-3-cpu-4-sf-1 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 3901
		3: 4036
		1: dead
		Error:  1: dead

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-07-01T01:48:54Z

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpchbench/tpch/nodes=3/cpu=4/sf=1 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	tpchbench.go:119,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-tpchbench-tpch-nodes-3-cpu-4-sf-1:4 -- ./workload run querybench --db=tpch --concurrency=1 --query-file=tpch --num-runs=3 --max-ops=63 --vectorized=false {pgurl:1-3} --histograms=logs/stats.json --histograms-max-latency=8m20s returned:
		stderr:
		
		stdout:
		n_name = 'FRANCE' AND n2.n_name = 'GERMANY') or (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')) AND l_shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31') AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year
		   5m34s        0            0.0            0.0      0.0      0.0      0.0      0.0  8: SELECT o_year, sum(CASE WHEN nation = 'BRAZIL' THEN volume ELSE 0 END) / sum(volume) AS mkt_share FROM ( SELECT EXTRACT(year FROM o_orderdate) AS o_year, l_extendedprice * (1 - l_discount) AS volume, n2.n_name AS nation FROM part, supplier, lineitem, orders, customer, nation n1, nation n2, region WHERE p_partkey = l_partkey AND s_suppkey = l_suppkey AND l_orderkey = o_orderkey AND o_custkey = c_custkey AND c_nationkey = n1.n_nationkey AND n1.n_regionkey = r_regionkey AND r_name = 'AMERICA' AND s_nationkey = n2.n_nationkey AND o_orderdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31' AND p_type = 'ECONOMY ANODIZED STEEL') AS all_nations GROUP BY o_year ORDER BY o_year
		: signal: interrupt
	cluster.go:1870,tpchbench.go:123,tpchbench.go:244,test.go:1249: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpchbench.go:123,tpchbench.go:244,test.go:1249: context canceled

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

cockroach-teamcity added this to the 19.2 milestone Jun 30, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jun 30, 2019

asubiotto mentioned this issue Jun 30, 2019

release: v19.2.0-alpha.20190701 #38381

Closed

14 tasks

nvanbenschoten mentioned this issue Jul 3, 2019

storage: release quota on failed Raft proposals #38632

Merged

craig bot closed this as completed in #38632 Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpchbench/tpch/nodes=3/cpu=4/sf=1 failed #38578

roachtest: tpchbench/tpch/nodes=3/cpu=4/sf=1 failed #38578

cockroach-teamcity commented Jun 30, 2019

cockroach-teamcity commented Jul 1, 2019

roachtest: tpchbench/tpch/nodes=3/cpu=4/sf=1 failed #38578

roachtest: tpchbench/tpch/nodes=3/cpu=4/sf=1 failed #38578

Comments

cockroach-teamcity commented Jun 30, 2019

cockroach-teamcity commented Jul 1, 2019