Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/nodes=3/w=max failed #35337

Closed
cockroach-teamcity opened this issue Mar 2, 2019 · 61 comments
Closed

roachtest: tpcc/nodes=3/w=max failed #35337

cockroach-teamcity opened this issue Mar 2, 2019 · 61 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	cluster.go:1226,tpcc.go:126,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1158877-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 0           20.1           28.8    369.1    536.9    536.9    536.9 delivery
		1h54m59s        0          294.5          287.1    302.0    536.9    637.5    671.1 newOrder
		1h54m59s        0           24.1           28.8     28.3     79.7     96.5     96.5 orderStatus
		1h54m59s        0          283.4          287.6    234.9    436.2    520.1   1073.7 payment
		1h54m59s        0           27.1           28.8    151.0    335.5    604.0    604.0 stockLevel
		 1h55m0s        0           37.0           28.8    243.3    402.7    486.5    486.5 delivery
		 1h55m0s        0          290.2          287.1    159.4    419.4    486.5    536.9 newOrder
		 1h55m0s        0           25.0           28.8     22.0     48.2     62.9     62.9 orderStatus
		 1h55m0s        0          321.3          287.6    125.8    318.8    369.1    503.3 payment
		 1h55m0s        0           27.0           28.8    184.5    251.7    436.2    436.2 stockLevel
		Error: error in delivery: ERROR: no inbound stream connection (SQLSTATE XX000)
		Error:  exit status 1
		: exit status 1
	cluster.go:1585,tpcc.go:136,tpcc.go:160,test.go:1211: Goexit() was called

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 2, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160398&tab=buildLog

The test failed on master:
	cluster.go:1226,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1160398-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		I190304 11:04:30.352403 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190304 11:04:31.022917 29 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190304 11:04:31.391171 30 ccl/workloadccl/fixture.go:498  imported district (1s, 23000 rows, 0 index entries, 2.3 MiB)
		I190304 11:04:31.919052 83 ccl/workloadccl/fixture.go:498  imported item (2s, 100000 rows, 0 index entries, 7.8 MiB)
		I190304 11:05:19.412668 82 ccl/workloadccl/fixture.go:498  imported new_order (49s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f53d12936efd36ea51eab6f191725d1dca2ceff3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1161673&tab=buildLog

The test failed on release-2.1:
	cluster.go:1244,tpcc.go:126,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1161673-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		     0            0.0           19.7      0.0      0.0      0.0      0.0 delivery
		    7m9s        0            3.0          187.9  34359.7  36507.2  36507.2  36507.2 newOrder
		    7m9s        0            0.0           19.6      0.0      0.0      0.0      0.0 orderStatus
		    7m9s        0            0.0          195.6      0.0      0.0      0.0      0.0 payment
		    7m9s        0            0.0           19.6      0.0      0.0      0.0      0.0 stockLevel
		   7m10s        0            0.0           19.6      0.0      0.0      0.0      0.0 delivery
		   7m10s        0            0.0          187.5      0.0      0.0      0.0      0.0 newOrder
		   7m10s        0            0.0           19.5      0.0      0.0      0.0      0.0 orderStatus
		   7m10s        0            1.0          195.2  49392.1  49392.1  49392.1  49392.1 payment
		   7m10s        0            0.0           19.6      0.0      0.0      0.0      0.0 stockLevel
		Error: error in payment: ERROR: context deadline exceeded (SQLSTATE XX000)
		Error:  exit status 1
		: exit status 1
	cluster.go:1603,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/959dcf7de0f94cfcfa0062387b109adebd1f11da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163706&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1163706-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		I190306 10:54:33.368894 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190306 10:54:35.071672 14 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190306 10:54:35.303740 13 ccl/workloadccl/fixture.go:498  imported warehouse (2s, 2300 rows, 0 index entries, 122 KiB)
		I190306 10:54:35.914278 68 ccl/workloadccl/fixture.go:498  imported item (3s, 100000 rows, 0 index entries, 7.8 MiB)
		I190306 10:56:01.380210 67 ccl/workloadccl/fixture.go:498  imported new_order (1m28s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/959dcf7de0f94cfcfa0062387b109adebd1f11da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163702&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:126,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1163702-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		79.2 payment
		  20m26s        0           20.0           25.7    167.8   1476.4   2550.1   2550.1 stockLevel
		  20m27s        0           30.9           25.7   1140.9   2281.7   2952.8   2952.8 delivery
		  20m27s        0          325.3          254.8   3087.0   4831.8  17179.9  28991.0 newOrder
		  20m27s        0           29.9           25.8     30.4    201.3    218.1    218.1 orderStatus
		  20m27s        0          227.5          257.0   1342.2   4026.5   5100.3   5905.6 payment
		  20m27s        0           25.9           25.7    159.4   1208.0   2013.3   2013.3 stockLevel
		Error: error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 1m14.36455315s (1551890589.961905598,1 > 1551890515.597352448,0), original timestamp 3m46.961282435s ago (1551890363.000623163,0): id=c1fae5ef key=/Table/60/1/3/0 rw=true pri=0.00444511 stat=PENDING epo=0 ts=1551890589.961905598,1 orig=1551890363.000623163,0 max=1551890363.002041569,0 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)
		Error:  exit status 1
		: exit status 1
	cluster.go:1603,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e6f0a720a98c60ed0545e7cbb4b2f224b463cae9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165134&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1165134-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		I190307 11:08:03.683573 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190307 11:08:05.848617 37 ccl/workloadccl/fixture.go:498  imported warehouse (2s, 2300 rows, 0 index entries, 122 KiB)
		I190307 11:08:06.001058 38 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190307 11:08:06.360473 70 ccl/workloadccl/fixture.go:498  imported item (3s, 100000 rows, 0 index entries, 7.8 MiB)
		I190307 11:09:37.353019 69 ccl/workloadccl/fixture.go:498  imported new_order (1m34s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f55596ea8c2bca016d036ec9399b80c17e7cfe93

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165148&tab=buildLog

The test failed on release-2.1:
	cluster.go:1244,tpcc.go:126,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1165148-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 8m56s        0            2.0          177.8  36507.2  42949.7  42949.7  42949.7 newOrder
		   8m56s        0            0.0           18.5      0.0      0.0      0.0      0.0 orderStatus
		   8m56s        0            0.0          183.7      0.0      0.0      0.0      0.0 payment
		   8m56s        0            0.0           18.6      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   8m57s        0            0.0           18.6      0.0      0.0      0.0      0.0 delivery
		   8m57s        0            1.0          177.5  38654.7  38654.7  38654.7  38654.7 newOrder
		   8m57s        0            0.0           18.4      0.0      0.0      0.0      0.0 orderStatus
		   8m57s        0            0.0          183.3      0.0      0.0      0.0      0.0 payment
		   8m57s        0            0.0           18.6      0.0      0.0      0.0      0.0 stockLevel
		Error: error in payment: ERROR: context deadline exceeded (SQLSTATE XX000)
		Error:  exit status 1
		: exit status 1
	cluster.go:1603,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168756&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1168756-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		I190309 11:10:30.475977 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190309 11:10:32.545696 66 ccl/workloadccl/fixture.go:498  imported warehouse (2s, 2300 rows, 0 index entries, 122 KiB)
		I190309 11:10:32.866376 67 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190309 11:10:33.634453 72 ccl/workloadccl/fixture.go:498  imported item (3s, 100000 rows, 0 index entries, 7.8 MiB)
		I190309 11:11:58.599997 71 ccl/workloadccl/fixture.go:498  imported new_order (1m28s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f7de3c35348e847dc0d4ab1ba8d76574e1706c2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169984&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1169984-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		I190310 11:02:12.928315 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190310 11:02:14.630207 32 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190310 11:02:14.750809 31 ccl/workloadccl/fixture.go:498  imported warehouse (2s, 2300 rows, 0 index entries, 122 KiB)
		I190310 11:02:15.671051 85 ccl/workloadccl/fixture.go:498  imported item (3s, 100000 rows, 0 index entries, 7.8 MiB)
		I190310 11:04:16.261771 84 ccl/workloadccl/fixture.go:498  imported new_order (2m3s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170799&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1170799-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		I190311 10:06:33.256032 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190311 10:06:34.468420 37 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190311 10:06:34.765400 38 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190311 10:06:35.532289 43 ccl/workloadccl/fixture.go:498  imported item (2s, 100000 rows, 0 index entries, 7.8 MiB)
		I190311 10:07:58.700229 42 ccl/workloadccl/fixture.go:498  imported new_order (1m25s, 20700000 rows, 0 index entries, 293 MiB)
		I190311 10:18:08.915495 41 ccl/workloadccl/fixture.go:498  imported order (11m36s, 69000000 rows, 138000000 index entries, 4.1 GiB)
		I190311 10:22:41.219039 40 ccl/workloadccl/fixture.go:498  imported history (16m8s, 69000000 rows, 138000000 index entries, 8.8 GiB)
		I190311 10:31:22.495562 39 ccl/workloadccl/fixture.go:498  imported customer (24m49s, 69000000 rows, 69000000 index entries, 40 GiB)
		Error:  exit status 255
		
		stdout:
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog

The test failed on master:
	cluster.go:1244,tpcc.go:126,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1170795-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 payment
		1h23m13s        0           33.0           26.2    520.1   2080.4   3355.4   3355.4 stockLevel
		1h23m14s        0           33.9           26.2   1476.4   2684.4   3221.2   3221.2 delivery
		1h23m14s        0          210.5          261.3   2080.4   5100.3   8321.5  81604.4 newOrder
		1h23m14s        0           21.9           26.2     67.1    369.1    570.4    570.4 orderStatus
		1h23m14s        0          282.3          261.9   2147.5  10737.4  23622.3 103079.2 payment
		1h23m14s        0           15.0           26.2    352.3   2281.7   2550.1   2550.1 stockLevel
		Error: error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 11.905747885s (1552330202.689705987,50 > 1552330190.783958102,0), original timestamp 2m31.954435703s ago (1552330050.735270284,0): id=bcc8b0f4 key=/Table/57/1/751/0 rw=true pri=0.04905430 stat=PENDING epo=0 ts=1552330202.689705987,50 orig=1552330050.735270284,0 max=1552330050.793185125,78 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)
		Error:  exit status 1
		: exit status 1
	cluster.go:1603,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176952&tab=buildLog

The test failed on master:
	cluster.go:1251,tpcc.go:115,tpcc.go:117,tpcc.go:160,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1176952-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190314 08:38:09.637423 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190314 08:38:11.030368 58 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190314 08:38:11.195722 59 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190314 08:38:11.708401 64 ccl/workloadccl/fixture.go:498  imported item (2s, 100000 rows, 0 index entries, 7.8 MiB)
		I190314 08:39:48.950202 63 ccl/workloadccl/fixture.go:498  imported new_order (1m39s, 20700000 rows, 0 index entries, 293 MiB)
		I190314 08:49:37.148307 62 ccl/workloadccl/fixture.go:498  imported order (11m28s, 69000000 rows, 138000000 index entries, 4.1 GiB)
		I190314 08:53:33.859434 61 ccl/workloadccl/fixture.go:498  imported history (15m24s, 69000000 rows, 138000000 index entries, 8.8 GiB)
		I190314 09:03:15.788454 60 ccl/workloadccl/fixture.go:498  imported customer (25m6s, 69000000 rows, 69000000 index entries, 40 GiB)
		Error:  exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog

The test failed on master:
	cluster.go:1251,tpcc.go:126,cluster.go:1589,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1176948-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 243.7          253.9   3489.7  10737.4  28991.0  49392.1 newOrder
		   43m5s        0           25.0           25.6   1208.0   3221.2   3221.2   3221.2 orderStatus
		   43m5s        0          314.6          255.0   3623.9  27917.3  77309.4 103079.2 payment
		   43m5s        0           23.0           25.5   2080.4   3489.7  11274.3  11274.3 stockLevel
		   43m6s        0           15.0           25.5   4563.4  17179.9  34359.7  34359.7 delivery
		   43m6s        0          248.3          253.9   3221.2  13421.8  47244.6 103079.2 newOrder
		   43m6s        0           23.0           25.6    805.3   2952.8   2952.8   2952.8 orderStatus
		   43m6s        0          180.2          255.0   2818.6  15032.4  40802.2 103079.2 payment
		   43m6s        0           22.0           25.5    771.8   3087.0   3087.0   3087.0 stockLevel
		Error: error in newOrder: ERROR: foreign key violation: value [914 3 3050] not found in order@primary [o_w_id o_d_id o_id] (txn=f995c83d-a90d-4856-9b61-72713ca22db5) (SQLSTATE 23503)
		Error:  exit status 1
		: exit status 1
	cluster.go:1610,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

@tbg
Copy link
Member

tbg commented Mar 15, 2019

Error: error in newOrder: ERROR: foreign key violation: value [914 3 3050] not found in order@primary [o_w_id o_d_id o_id] (txn=f995c83d-a90d-4856-9b61-72713ca22db5) (SQLSTATE 23503)
Error: exit status 1
: exit status 1

uhm what? This sounds alarming -- @jordanlewis

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/70e3468e7ed5fb30b10eaaf972acbb0f16099d01

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178894&tab=buildLog

The test failed on master:
	cluster.go:1267,tpcc.go:119,tpcc.go:121,tpcc.go:164,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1178894-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190315 09:13:39.998601 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190315 09:13:41.422463 58 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190315 09:13:41.751366 59 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190315 09:13:41.915999 64 ccl/workloadccl/fixture.go:498  imported item (2s, 100000 rows, 0 index entries, 7.8 MiB)
		I190315 09:15:09.129827 63 ccl/workloadccl/fixture.go:498  imported new_order (1m29s, 20700000 rows, 0 index entries, 293 MiB)
		I190315 09:24:56.788535 62 ccl/workloadccl/fixture.go:498  imported order (11m17s, 69000000 rows, 138000000 index entries, 4.1 GiB)
		I190315 09:29:25.294492 61 ccl/workloadccl/fixture.go:498  imported history (15m45s, 69000000 rows, 138000000 index entries, 8.8 GiB)
		I190315 09:37:58.645737 60 ccl/workloadccl/fixture.go:498  imported customer (24m19s, 69000000 rows, 69000000 index entries, 40 GiB)
		Error:  exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/70e3468e7ed5fb30b10eaaf972acbb0f16099d01

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178890&tab=buildLog

The test failed on master:
	cluster.go:1267,tpcc.go:130,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1178890-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		12884.9 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  1h6m9s        0           26.0           26.2   2147.5  21474.8  34359.7  34359.7 delivery
		  1h6m9s        0          306.9          261.5   3221.2  11811.2  47244.6 103079.2 newOrder
		  1h6m9s        0           18.0           26.3     35.7    100.7    142.6    142.6 orderStatus
		  1h6m9s        0          267.9          262.4   1208.0  10737.4  24696.1  49392.1 payment
		  1h6m9s        0           28.0           26.3    318.8   5368.7  40802.2  40802.2 stockLevel
		Error: error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 23.938195581s (1552672841.710262803,149 > 1552672817.772067222,0), original timestamp 1m37.576848872s ago (1552672744.133413931,0): id=92bfa49e key=/Table/61/1/500/0 rw=true pri=0.05055075 stat=PENDING epo=0 ts=1552672841.710262803,149 orig=1552672744.133413931,0 max=1552672744.133413931,1 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)
		Error:  exit status 1
		: exit status 1
	cluster.go:1626,tpcc.go:140,tpcc.go:164,test.go:1214: Goexit() was called

@nvanbenschoten
Copy link
Member

I pulled #35337 (comment) into a separate tracking issue: #35812.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/04ef15974085e14f758b20c552a84052eac9fa2b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1180757&tab=buildLog

The test failed on master:
	cluster.go:1267,tpcc.go:119,tpcc.go:121,tpcc.go:164,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1180757-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190316 10:22:48.817302 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190316 10:22:50.035024 14 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190316 10:22:50.092109 15 ccl/workloadccl/fixture.go:498  imported district (1s, 23000 rows, 0 index entries, 2.3 MiB)
		I190316 10:22:51.208586 85 ccl/workloadccl/fixture.go:498  imported item (2s, 100000 rows, 0 index entries, 7.8 MiB)
		I190316 10:25:00.533833 84 ccl/workloadccl/fixture.go:498  imported new_order (2m12s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c2939ec9a4f15b7fb8683a5805deeb241953e7aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1182995&tab=buildLog

The test failed on master:
	cluster.go:1267,tpcc.go:119,tpcc.go:121,tpcc.go:164,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1182995-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190318 08:53:14.533056 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190318 08:53:15.870735 16 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190318 08:53:16.322067 66 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190318 08:53:17.088452 71 ccl/workloadccl/fixture.go:498  imported item (3s, 100000 rows, 0 index entries, 7.8 MiB)
		I190318 08:55:05.423908 70 ccl/workloadccl/fixture.go:498  imported new_order (1m51s, 20700000 rows, 0 index entries, 293 MiB)
		Error:  exit status 255
		: exit status 1

@tbg
Copy link
Member

tbg commented Mar 18, 2019

Ok, this test should really pass if it doesn't highlight specific failures. I'm digging into these weird failure modes now.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183682&tab=buildLog

The test failed on provisional_201903122203_v19.1.0-beta.20190318:
	cluster.go:1267,tpcc.go:119,tpcc.go:121,tpcc.go:164,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1183682-tpcc-nodes-3-w-max:4 -- ./workload fixtures import tpcc --checks=true --warehouses=2300 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190318 15:20:02.727743 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
		I190318 15:20:03.967507 14 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 2300 rows, 0 index entries, 122 KiB)
		I190318 15:20:04.267273 15 ccl/workloadccl/fixture.go:498  imported district (2s, 23000 rows, 0 index entries, 2.3 MiB)
		I190318 15:20:04.842390 69 ccl/workloadccl/fixture.go:498  imported item (2s, 100000 rows, 0 index entries, 7.8 MiB)
		I190318 15:22:13.928330 68 ccl/workloadccl/fixture.go:498  imported new_order (2m11s, 20700000 rows, 0 index entries, 293 MiB)
		I190318 15:31:48.497016 67 ccl/workloadccl/fixture.go:498  imported order (11m46s, 69000000 rows, 138000000 index entries, 4.1 GiB)
		I190318 15:37:21.814380 66 ccl/workloadccl/fixture.go:498  imported history (17m19s, 69000000 rows, 138000000 index entries, 8.8 GiB)
		Error:  exit status 255
		: exit status 1

@tbg
Copy link
Member

tbg commented Mar 18, 2019

One issue is definitely the lease imbalance here:

image

Test has been running for 45 minutes and order_line claims it needs another 2h to finish (the rest is done). Maybe this triggeres some timeout... somewhere? I don't see one on the TPCC test itself. It uses a 2h duration but that's just how long the workload runs for (it doesn't run).

My run (from which the above screenshot was captured) has definitely already made it past the failures above, but we can also see that the import makes some progress but never gets order_line in. Now just have to figure out what the timeout is. Are we perhaps idling the ssh or sql session and getting disconnected?

@tbg
Copy link
Member

tbg commented Mar 18, 2019

This is my ssh invocation running the import on node four:

tschott+ 10025  0.0  0.0  46836  5848 pts/1    S+   16:57   0:00 ssh ubuntu@52.15.81.199 -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /home/tschottdorf/.ssh/id_rsa -i /home/tschottdorf/.ssh/google_compute_engine export ROACHPROD=4 GOTRACEBACK=crash && bash -c "./workload fixtures import tpcc --checks=true --warehouses=2300 'postgres://root@172.31.42.122:26257?sslmode=disable'"

Doesn't look like keep alives are on. They're off by default. Might be worth a shot.

tbg added a commit to tbg/cockroach that referenced this issue Mar 18, 2019
This can't hurt, and could explain

cockroachdb#35337 (comment)

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195736&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,tpcc.go:130,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1195736-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=2300 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 0            0.0           48.2      0.0      0.0      0.0      0.0 delivery
		  1h2m6s        0            0.0          479.9      0.0      0.0      0.0      0.0 newOrder
		  1h2m6s        0            0.0           48.3      0.0      0.0      0.0      0.0 orderStatus
		  1h2m6s        0            0.0          481.4      0.0      0.0      0.0      0.0 payment
		  1h2m6s        0            0.0           48.2      0.0      0.0      0.0      0.0 stockLevel
		  1h2m7s        0            0.0           48.2      0.0      0.0      0.0      0.0 delivery
		  1h2m7s        0            0.0          479.7      0.0      0.0      0.0      0.0 newOrder
		  1h2m7s        0            0.0           48.2      0.0      0.0      0.0      0.0 orderStatus
		  1h2m7s        0            0.0          481.2      0.0      0.0      0.0      0.0 payment
		  1h2m7s        0            0.0           48.2      0.0      0.0      0.0      0.0 stockLevel
		Error: error in delivery: ERROR: no inbound stream connection (SQLSTATE XXUUU)
		Error:  exit status 1
		: exit status 1
	cluster.go:1626,tpcc.go:140,tpcc.go:165,test.go:1214: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5a746073c3f8ede851f37dd895cf1a91d6dcc3cf

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195714&tab=buildLog

The test failed on master:
	cluster.go:1267,tpcc.go:130,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1195714-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		            0.0           25.3      0.0      0.0      0.0      0.0 delivery
		   31m7s        0            0.0          250.7      0.0      0.0      0.0      0.0 newOrder
		   31m7s        0            0.0           25.3      0.0      0.0      0.0      0.0 orderStatus
		   31m7s        0            0.0          252.6      0.0      0.0      0.0      0.0 payment
		   31m7s        0            0.0           25.2      0.0      0.0      0.0      0.0 stockLevel
		   31m8s        0            0.0           25.3      0.0      0.0      0.0      0.0 delivery
		   31m8s        0            0.0          250.6      0.0      0.0      0.0      0.0 newOrder
		   31m8s        0            0.0           25.3      0.0      0.0      0.0      0.0 orderStatus
		   31m8s        0            0.0          252.5      0.0      0.0      0.0      0.0 payment
		   31m8s        0            0.0           25.2      0.0      0.0      0.0      0.0 stockLevel
		Error: error in stockLevel: ERROR: no inbound stream connection (SQLSTATE XXUUU)
		Error:  exit status 1
		: exit status 1
	cluster.go:1626,tpcc.go:140,tpcc.go:165,test.go:1214: Goexit() was called

@asubiotto
Copy link
Contributor

So far I have not hit a reproduction of the no inbound stream connection error (since yesterday morning). Note that the latest failure includes @jordanlewis's fixes for the deadlock issue with flow cancellation #35931.

Looking at the logs from the failures it seems like the cluster is overloaded given the slow liveness heartbeats and handle raft ready. This could also cause SetupFlow requests to take a long enough time to not connect back in the 10s window. We've seen this mostly with memory pressure so attempting to reproduce with GODEBUG=gctrace=1.

Will look at the goroutine dumps to see if there's anything interesting.

I've also been wondering if:

  1. We should increase the inbound stream timeout. With a customer's workload that would overload the cluster, we saw this limit hit very easily.
  2. We might want to make no inbound stream connection retriable.
    cc @jordanlewis @andreimatei for these thoughts

@tbg I don't think I would classify this failure in the context of this test as a release blocker, but let me know if you think otherwise. It's essentially a hard to hit query timeout.

@tbg
Copy link
Member

tbg commented Mar 27, 2019

@tbg I don't think I would classify this failure in the context of this test as a release blocker, but let me know if you think otherwise. It's essentially a hard to hit query timeout.

I'll let dan chime in too but this test currently constitutes our fledging release whitelist, meaning that if this test fails, we don't release. Perhaps the conclusion can be that we're pushing w=max too far and our "true max" is lower. From the looks of it ("it" being the most recent repro above)

The logs of n2 and n3 do have a few slow heartbeats before the no inbound stream error, but generally look good until they don't:

I190324 17:27:52.755513 359 server/status/runtime.go:500  [n2] runtime stats: 6.5 GiB RSS, 4337 goroutines, 853 MiB/315 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 209724.6 CGO/sec, 1070.7/127.1 %(u/s)time, 0.1 %gc (6x), 89 MiB/99 MiB (r/w)net
W190324 17:27:58.277953 366 storage/node_liveness.go:518  [n2,hb] slow heartbeat took 1.1s
I190324 17:28:02.795266 359 server/status/runtime.go:500  [n2] runtime stats: 6.5 GiB RSS, 3654 goroutines, 956 MiB/326 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 150268.6 CGO/sec, 981.3/134.1 %(u/s)time, 0.1 %gc (6x), 90 MiB/93 MiB (r/w)net
W190324 17:28:07.750580 366 storage/node_liveness.go:518  [n2,hb] slow heartbeat took 1.5s
I190324 17:28:12.832913 359 server/status/runtime.go:500  [n2] runtime stats: 6.6 GiB RSS, 4544 goroutines, 991 MiB/299 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.8 GiB CGO alloc/total, 212396.5 CGO/sec, 1093.9/131.6 %(u/s)time, 0.1 %gc (6x), 95 MiB/106 MiB (r/w)net
I190324 17:28:22.877750 359 server/status/runtime.go:500  [n2] runtime stats: 6.6 GiB RSS, 4048 goroutines, 1.3 GiB/236 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 144819.3 CGO/sec, 1084.7/131.5 %(u/s)time, 0.1 %gc (7x), 98 MiB/105 MiB (r/w)net
I190324 17:28:32.911487 359 server/status/runtime.go:500  [n2] runtime stats: 6.5 GiB RSS, 3925 goroutines, 1.0 GiB/346 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 102940.2 CGO/sec, 383.3/67.6 %(u/s)time, 0.0 %gc (1x), 31 MiB/31 MiB (r/w)net
I190324 17:28:42.934982 359 server/status/runtime.go:500  [n2] runtime stats: 6.5 GiB RSS, 3949 goroutines, 833 MiB/338 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 86940.8 CGO/sec, 663.5/87.7 %(u/s)time, 0.1 %gc (4x), 53 MiB/55 MiB (r/w)net
I190324 17:28:43.711137 355 gossip/gossip.go:557  [n2] gossip status (ok, 3 nodes)
gossip client (1/3 cur/max conns)
  1: 35.196.9.240:26257 (2h49m0s: infos 22874/37130 sent/received, bytes 4721137B/8351985B sent/received)
gossip server (0/3 cur/max conns, infos 22874/37130 sent/received, bytes 4721137B/8351985B sent/received)
E190324 17:28:46.881349 18677122 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:83d925ee-2057-4906-a863-7eeb55f64cc3 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886394 18677195 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:0afe19f6-e269-4d67-a2d9-621a0971ab30 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886396 18677285 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:11aaa2ea-7f88-405e-9bb8-4cfa49310dc9 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886516 18677286 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:c33a41e0-6688-42f9-802f-4f828a54128b : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886540 18677287 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:cdbd667d-f4d9-4f5c-8624-f966d615c5a0 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886564 18677288 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:26ba0a9f-88a4-4ff3-ac20-dab772a958b1 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886602 18677196 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:73cf62fa-17d0-41d1-9138-b5ecbbe4ef2c : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886615 18677143 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:e1f8bee6-cc6c-4e23-9eda-51dd3d13f836 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886618 18677214 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:0d0f37bc-05a0-4ea2-8fb3-ebf833452dd1 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886629 18677142 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:8ad243e3-13c4-482f-a80c-34a674e45601 : 1 inbound streams timed out after 10s; propagated error throughout flow
E190324 17:28:46.886650 18677105 sql/distsqlrun/flow_registry.go:224  [intExec=get-table-statistics] flow id:84ec6b8d-4628-4123-be61-38d7efa17d5b : 1 inbound streams timed out after 10s; propagated error throughout flow

i.e. there's nothing going on except a system hard at work, and boom suddenly the inbound streams pop up. But when we look at n1 it's another story, that node does look legitimately in trouble. One indication of this is the goroutine dump taken at 17:15 (the inbound errors are at 17:28). This is during the splits, so I'm willing to say it's just overloaded during the import and the load hasn't started, but let's look anyway. There's contention on lots of locks, notably the stopper's (maybe this can give a perf boost optimizing at some point), and the histogram computation in handleRaftReady (cc @ajwerner). This all could just be fallout of the system being overloaded. (PS I got this from the goroutine_dumper in the logs which already proves a good investment)

panicparsed.log

The log of n1 around the time of the inbound stream problems indicates that the node was under heavy stress there as well. It hovers at around 7k goroutines (other nodes have roughly half that), various handleRaftReady take 0.6-0.9s. These are highcpu GCE machines (16 cpus) and 13.5 of them are are maxed out if the stats are to be believed.

I190324 17:28:00.616989 279 server/status/runtime.go:500 [n1] runtime stats: 6.8 GiB RSS, 7241 goroutines, 1.1 GiB/416 MiB/2.1 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 235101.6 CGO/sec, 1220.6/138.8 %(u/s)time, 0.2 %gc (6x), 96 MiB/108 MiB (r/w)net

@ajwerner I'd appreciate if you gave this last repro a good look and maybe ran this yourself, watching out for signs of an overload. It's hard for me to judge whether this is all just reaching it's natural breaking point or (which I edit:don't assume) something stupid is going on. The memory pressure doesn't seem too high though (if I'm reading the 0.2% gc right).

BTW can someone help me explain this beauty (probably @vivekmenezes)

I190324 16:11:29.281718 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:29.462325 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:29.654777 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:29.846804 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:30.042346 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:30.234288 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:30.412164 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:30.616656 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:30.850010 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:31.032439 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]
I190324 16:11:31.263257 769 sql/lease.go:307  publish (1 leases): desc=[{stock 61 1}]

These lines repeat at roughly this cadence for around four minutes. Looks like something needs to be fixed there.

log.Infof(context.TODO(), "publish (%d leases): desc=%v", count, tables)

@tbg
Copy link
Member

tbg commented Mar 27, 2019

It's also unclear why n1 gets slammed so much more than the other nodes. Is the balance maybe just off?

@tbg
Copy link
Member

tbg commented Mar 27, 2019

@bdarnell to look into this failure mode with @petermattis as a second pair of 👀

@andreimatei
Copy link
Contributor

We might want to make no inbound stream connection retriable.
cc @jordanlewis @andreimatei for these thoughts

It's as retriable as other RPC errors.

@ajwerner
Copy link
Contributor

I verified that saturating network links does not seem to be the problem. On the hunch that 96 MiB/108 MiB (r/w)net might be coming close to saturating a 1Gbit network link I went and measured the network throughput between hosts of this machine type and I can report that that is not the problem. iperf gets 8.57 Gbits/sec over a single TCP connection.

@tbg
Copy link
Member

tbg commented Mar 27, 2019 via email

@ajwerner
Copy link
Contributor

I just blindly assumed that was a per second number given the others were. I prefer bytes to bits even though it's networking. I'd prefer to never have to deal in bit. I just got thrown by the time unit.

@asubiotto
Copy link
Contributor

asubiotto commented Mar 27, 2019

By the way, alfonso-test is still up with tpcc 2k restored. Should I extend it for anyone who wants to take a look? Given that we've seen something fishy during import it might make sense to just start with a new cluster.

One other thing I've tried is manually triggering a stats job for order_line given that a lot of the failed flows seemed to be statistics jobs, however this did not produce the failure.

@bdarnell
Copy link
Contributor

It's also unclear why n1 gets slammed so much more than the other nodes. Is the balance maybe just off?

Maybe it's the load-based-rebalancing death spiral seen in #34590 ?

@bdarnell
Copy link
Contributor

The exit status 255 failure is coming from ssh. With -v added to the ssh command line, we see that it's a "broken pipe"

--- FAIL: tpcc/nodes=3/w=max (5070.79s)
	cluster.go:1293,tpcc.go:119,tpcc.go:121,tpcc.go:165,test.go:1223: /Users/bdarnell/cockroach/bin/roachprod run ben-1553980169-tpcc-nodes-3-w-max:4 -- ./workload fixtures load tpcc --checks=true --warehouses=2000 {pgurl:1} returned:
		stderr:
		ebug1: Authentication succeeded (publickey).
		Authenticated to 35.185.49.255 ([35.185.49.255]:22).
		debug1: channel 0: new [client-session]
		debug1: Requesting no-more-sessions@openssh.com
		debug1: Entering interactive session.
		debug1: pledge: network
		debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
		debug1: Sending environment.
		debug1: Sending env LANG = en_US.UTF-8
		debug1: Sending command: export ROACHPROD=4 GOTRACEBACK=crash && bash -c "./workload fixtures load tpcc --checks=true --warehouses=2000 'postgres://root@10.142.0.32:26257?sslmode=disable'"
		I190330 21:11:51.954354 1 ccl/workloadccl/cliccl/fixtures.go:296  starting load of 9 tables
		I190330 21:14:36.156765 150 ccl/workloadccl/fixture.go:600  loaded district (2m44s, 20000 rows, 0 index entries, 2.0 MiB)
		I190330 21:17:24.120448 149 ccl/workloadccl/fixture.go:600  loaded warehouse (5m32s, 2000 rows, 0 index entries, 106 KiB)
		packet_write_wait: Connection to 35.185.49.255 port 22: Broken pipe
		Error:  exit status 255
		
		stdout:
		: exit status 1
FAIL

This is the error that is expected if an intermediary network device forgets about the connection due to inactivity. However, we're running with ServerAliveInterval=60 to prevent the connection from being completely idle (unless there's something in the path that has a 60s timeout and we need to send our pings every 55s to reliably keep the connection alive. But 60s would be very short for a timeout at this level). We see this in a variety of network conditions, including teamcity where everything should be staying within the GCP network.

I would expect problems on the host we're sshing to to result in "connection reset" errors, not "broken pipe". But I can't rule it out; there may be something that could cause this failure mode instead.

Within minutes of first boot we're getting (and rejecting) brute force ssh attacks. Shouldn't make a difference; just an observation:

Mar 31 14:13:01 ben-1554041301-tpcc-nodes-3-w-max-0004 sshd[3435]: error: maximum authentication attempts exceeded for root from 42.61.24.202 port 35112 ssh2 [preauth]
Mar 31 14:13:01 ben-1554041301-tpcc-nodes-3-w-max-0004 sshd[3435]: Disconnecting: Too many authentication failures [preauth]

In any case, the cockroach processes seem fine even after the test has failed with exit status 255. They don't crash, and they keep processing the import. So this subset of failures appears to just be an issue of test automation and doesn't indicate a problem in cockroach.

@tbg
Copy link
Member

tbg commented Mar 31, 2019

^- on the off chance that it might help, should we try lowering the ServerAliveInterval further, to something that's definitely good enough like 10s?

@bdarnell
Copy link
Contributor

Seems unlikely to help, but couldn't hurt.

I assume in the CI environment we've sorted out all the versioning issues and are using the right build of roachprod? That's something that still often trips me up running locally.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/nodes=3/w=max PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,tpcc.go:130,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212269-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 0.0      0.0      0.0 stockLevel
		   1m22s        0            1.0            2.5   4563.4   4563.4   4563.4   4563.4 delivery
		   1m22s        0            7.0           12.2   2281.7   3087.0   3087.0   3087.0 newOrder
		   1m22s        0            0.0            3.1      0.0      0.0      0.0      0.0 orderStatus
		   1m22s        0            6.0           19.4   1275.1   1610.6   1610.6   1610.6 payment
		   1m22s        0            0.0            0.3      0.0      0.0      0.0      0.0 stockLevel
		   1m23s        0            2.0            2.5   2684.4   2818.6   2818.6   2818.6 delivery
		   1m23s        0            0.0           12.1      0.0      0.0      0.0      0.0 newOrder
		   1m23s        0            0.0            3.0      0.0      0.0      0.0      0.0 orderStatus
		   1m23s        0            3.0           19.2   1476.4   1543.5   1543.5   1543.5 payment
		   1m23s        0            0.0            0.3      0.0      0.0      0.0      0.0 stockLevel
		Error: error in newOrder: EOF
		Error:  exit status 1
		: exit status 1
	cluster.go:1652,tpcc.go:140,tpcc.go:165,test.go:1223: unexpected node event: 2: dead
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1652,tpcc.go:140,tpcc.go:165,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1212269-tpcc-nodes-3-w-max --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: dead
		3: 4054
		1: 4436
		Error:  2: dead

@bdarnell
Copy link
Contributor

bdarnell commented Apr 1, 2019

That last failure is an OOM on node 2. Memory and goroutines were stable before spiking in the last two minutes before the crash.

I190401 16:42:03.817147 46 server/status/runtime.go:500  [n2] runtime stats: 5.6 GiB RSS, 319 goroutines, 234 MiB/884 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/4.9 GiB CGO alloc/total, 69249.9 CGO/sec, 112.1/6.7 %(u/s)time, 0.0 %gc (7x), 304 KiB/204 KiB (r/w)net
I190401 16:42:13.840686 46 server/status/runtime.go:500  [n2] runtime stats: 5.7 GiB RSS, 319 goroutines, 217 MiB/901 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 63389.0 CGO/sec, 109.6/6.0 %(u/s)time, 0.0 %gc (7x), 302 KiB/297 KiB (r/w)net
I190401 16:42:23.868594 46 server/status/runtime.go:500  [n2] runtime stats: 5.8 GiB RSS, 2099 goroutines, 332 MiB/788 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 56935.5 CGO/sec, 273.1/13.4 %(u/s)time, 0.1 %gc (18x), 3.5 MiB/75 MiB (r/w)net
I190401 16:42:34.041438 46 server/status/runtime.go:500  [n2] runtime stats: 5.8 GiB RSS, 2095 goroutines, 325 MiB/798 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 51073.9 CGO/sec, 139.1/11.6 %(u/s)time, 0.0 %gc (6x), 3.4 MiB/2.3 MiB (r/w)net
I190401 16:42:44.065249 46 server/status/runtime.go:500  [n2] runtime stats: 5.8 GiB RSS, 2095 goroutines, 294 MiB/860 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 67992.8 CGO/sec, 135.4/14.0 %(u/s)time, 0.0 %gc (5x), 2.3 MiB/1.3 MiB (r/w)net
I190401 16:42:54.339793 46 server/status/runtime.go:500  [n2] runtime stats: 5.8 GiB RSS, 2126 goroutines, 638 MiB/514 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 13975.6 CGO/sec, 559.9/22.7 %(u/s)time, 0.1 %gc (10x), 134 MiB/106 MiB (r/w)net
I190401 16:43:04.395685 46 server/status/runtime.go:500  [n2] runtime stats: 6.8 GiB RSS, 2214 goroutines, 1.4 GiB/243 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 97970.7 CGO/sec, 1483.7/20.2 %(u/s)time, 0.1 %gc (8x), 480 MiB/108 MiB (r/w)net
I190401 16:43:14.636629 46 server/status/runtime.go:500  [n2] runtime stats: 7.7 GiB RSS, 2333 goroutines, 2.4 GiB/164 MiB/2.7 GiB GO alloc/idle/total, 3.8 GiB/4.9 GiB CGO alloc/total, 70716.1 CGO/sec, 1502.4/19.4 %(u/s)time, 0.0 %gc (4x), 382 MiB/148 MiB (r/w)net
I190401 16:43:24.822567 46 server/status/runtime.go:500  [n2] runtime stats: 8.9 GiB RSS, 2540 goroutines, 3.6 GiB/43 MiB/3.9 GiB GO alloc/idle/total, 3.8 GiB/4.9 GiB CGO alloc/total, 20889.6 CGO/sec, 1534.6/23.5 %(u/s)time, 0.4 %gc (4x), 628 MiB/209 MiB (r/w)net
I190401 16:43:34.929251 46 server/status/runtime.go:500  [n2] runtime stats: 10 GiB RSS, 2823 goroutines, 4.8 GiB/377 MiB/5.4 GiB GO alloc/idle/total, 3.8 GiB/4.9 GiB CGO alloc/total, 72971.7 CGO/sec, 1532.3/28.6 %(u/s)time, 0.0 %gc (2x), 491 MiB/126 MiB (r/w)net
I190401 16:43:45.214057 46 server/status/runtime.go:500  [n2] runtime stats: 11 GiB RSS, 2867 goroutines, 6.0 GiB/210 MiB/6.5 GiB GO alloc/idle/total, 3.8 GiB/5.0 GiB CGO alloc/total, 24142.6 CGO/sec, 1508.9/21.0 %(u/s)time, 0.0 %gc (2x), 310 MiB/82 MiB (r/w)net
I190401 16:43:55.408768 46 server/status/runtime.go:500  [n2] runtime stats: 12 GiB RSS, 3187 goroutines, 6.4 GiB/385 MiB/7.2 GiB GO alloc/idle/total, 3.9 GiB/5.0 GiB CGO alloc/total, 67605.4 CGO/sec, 1539.0/21.0 %(u/s)time, 0.0 %gc (1x), 256 MiB/78 MiB (r/w)net

Right before the crash there were a lot of slow handleRaftReadies:

W190401 16:43:50.072708 236 storage/store.go:3658  [n2,s2,r1103/2:/Table/59/3/3{05/7/1…-34/7/2…}] handle raft ready: 0.5s [processed=0]
W190401 16:43:50.072834 256 storage/store.go:3658  [n2,s2,r1106/2:/Table/59/3/1{137/1/…-200/1/…}] handle raft ready: 0.5s [processed=0]
W190401 16:43:50.075383 254 storage/store.go:3658  [n2,s2,r5554/2:/Table/58/1/61{4/918…-5/511…}] handle raft ready: 0.5s [processed=0]
W190401 16:43:50.075550 227 storage/store.go:3658  [n2,s2,r5509/2:/Table/58/1/61{3/879…-4/918…}] handle raft ready: 0.5s [processed=0]
W190401 16:43:50.075787 267 storage/store.go:3658  [n2,s2,r4885/2:/Table/58/1/59{6/695…-7/735…}] handle raft ready: 0.5s [processed=0]
...
W190401 16:43:51.376175 197 storage/store.go:3658  [n2,s2,r1123/2:/Table/59/3/3{34/7/2…-97/7/2…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.376296 227 storage/store.go:3658  [n2,s2,r1097/2:/Table/59/3/6{49/7/6…-96/2/1…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.376419 255 storage/store.go:3658  [n2,s2,r6469/2:/Table/58/1/8{59/126…-60/166…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.376503 241 storage/store.go:3658  [n2,s2,r6461/2:/Table/58/1/85{8/871…-9/126…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.389505 279 storage/store.go:3658  [n2,s2,r1094/2:/Table/59/3/{241/9/…-305/7/…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.389596 248 storage/store.go:3658  [n2,s2,r966/2:/Table/59/3/{948/1/…-1011/1…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.395516 246 storage/store.go:3658  [n2,s2,r1107/2:/Table/59/3/{696/2/…-759/2/…}] handle raft ready: 0.6s [processed=0]
W190401 16:43:51.400614 224 storage/store.go:3658  [n2,s2,r5681/2:/Table/58/1/61{8/379…-9/419…}] handle raft ready: 0.6s [processed=0]
...
W190401 16:44:01.525343 284 storage/store.go:3658  [n2,s2,r5818/2:/Table/58/1/74{4/760…-5/800…}] handle raft ready: 0.7s [processed=0]
W190401 16:44:01.525463 201 storage/store.go:3658  [n2,s2,r7560/2:/Table/58/1/111{3/680…-4/720…}] handle raft ready: 0.7s [processed=0]
W190401 16:44:01.525520 213 storage/store.go:3658  [n2,s2,r7551/2:/Table/58/1/111{2/640…-3/680…}] handle raft ready: 0.7s [processed=0]
W190401 16:44:01.525587 182 storage/store.go:3658  [n2,s2,r7584/2:/Table/58/1/9{89/403…-90/442…}] handle raft ready: 0.7s [processed=0]
W190401 16:44:01.525646 310 storage/store.go:3658  [n2,s2,r7754/2:/Table/58/1/112{2/638…-3/678…}] handle raft ready: 0.7s [processed=0]
W190401 16:44:01.525709 252 storage/store.go:3658  [n2,s2,r7745/2:/Table/58/1/112{1/829…-2/638…}] handle raft ready: 0.7s [processed=0]

Maybe the disk just got slow and the lack of admission control let it blow up? The goroutine stacks look fairly normal (although more than usual are shown as "runnable", even those inside Mutex.Lock calls).

@tbg
Copy link
Member

tbg commented Apr 1, 2019

Nothing in logs/heap_profiler?

@bdarnell
Copy link
Contributor

bdarnell commented Apr 1, 2019

Yes, there's a heap profile.

image

@bdarnell
Copy link
Contributor

bdarnell commented Apr 1, 2019

I believe all of these failure modes (except the "exit status 255") may just be the failure modes that are possible when a cluster is under sustained overload conditions. According to tpccbench we can't reliably hit 1350 warehouses any more on this hardware (On one run, I got 1395, and on a second run for the same commit I got 1315). I think we should reduce the thresholds here and see if that resolves the flakiness.

Use #36097 for discussing why things have gotten slower and what we can do about it.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 1, 2019
Make large cuts to deflake the test until we can get enough data to
put a tighter bound on it. The AWS case does not appear to have passed
since its introduction.

Closes cockroachdb#35337
Updates cockroachdb#36097

Release note: None
craig bot pushed a commit that referenced this issue Apr 2, 2019
36401: roachtest: Reduce tpcc/w=max targets r=tbg a=bdarnell

Make large cuts to deflake the test until we can get enough data to
put a tighter bound on it. The AWS case does not appear to have passed
since its introduction.

Closes #35337
Updates #36097

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 2, 2019
Make large cuts to deflake the test until we can get enough data to
put a tighter bound on it. The AWS case does not appear to have passed
since its introduction.

Closes cockroachdb#35337
Updates cockroachdb#36097

Release note: None
@craig craig bot closed this as completed in #36401 Apr 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

9 participants