Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/initial-scan/rangefeed=true failed #35327

Closed
cockroach-teamcity opened this issue Mar 2, 2019 · 29 comments
Closed

roachtest: cdc/initial-scan/rangefeed=true failed #35327

cockroach-teamcity opened this issue Mar 2, 2019 · 29 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1158877-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		 1.2      0.0      0.0      0.0      0.0 delivery
		  27m38s      214            0.0           11.9      0.0      0.0      0.0      0.0 newOrder
		  27m38s      214            0.0            1.2      0.0      0.0      0.0      0.0 orderStatus
		  27m38s      214            0.0           11.9      0.0      0.0      0.0      0.0 payment
		  27m38s      214            0.0            1.1      0.0      0.0      0.0      0.0 stockLevel
		E190302 07:48:57.133553 1 workload/cli/run.go:419  error in orderStatus: dial tcp 10.142.0.32:26257: connect: connection refused
		  27m39s      231            0.0            1.2      0.0      0.0      0.0      0.0 delivery
		  27m39s      231            0.0           11.9      0.0      0.0      0.0      0.0 newOrder
		  27m39s      231            0.0            1.2      0.0      0.0      0.0      0.0 orderStatus
		  27m39s      231            0.0           11.9      0.0      0.0      0.0      0.0 payment
		  27m39s      231            0.0            1.1      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1585,cdc.go:212,cdc.go:382,test.go:1211: dial tcp 35.231.15.42:26257: connect: connection refused

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 2, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d888b76df319571afe4d5816f1a1f0f53905653f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1159641&tab=buildLog

The test failed on master:
	cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1159641-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   0.0 newOrder
		  13m39s      193            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  13m39s      193            0.0           17.1      0.0      0.0      0.0      0.0 payment
		  13m39s      193            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		E190303 07:35:23.555181 1 workload/cli/run.go:419  error in orderStatus: dial tcp 10.142.0.123:26257: connect: connection refused
		  13m40s      206            1.0            1.7 103079.2 103079.2 103079.2 103079.2 delivery
		  13m40s      206            0.0           17.0      0.0      0.0      0.0      0.0 newOrder
		  13m40s      206            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  13m40s      206            0.0           17.0      0.0      0.0      0.0      0.0 payment
		  13m40s      206            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		E190303 07:35:24.555268 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.123:26257: connect: connection refused
		: signal: killed
	cluster.go:1585,cdc.go:212,cdc.go:382,test.go:1211: unexpected node event: 3: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1160394-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     0.0 newOrder
		  27m51s      153            0.0            1.7      0.0      0.0      0.0      0.0 orderStatus
		  27m51s      153            0.0           16.7      0.0      0.0      0.0      0.0 payment
		  27m51s      153            0.0            1.6      0.0      0.0      0.0      0.0 stockLevel
		E190304 07:50:09.812300 1 workload/cli/run.go:419  error in orderStatus: dial tcp 10.142.0.45:26257: connect: connection refused
		  27m52s      170            0.0            1.7      0.0      0.0      0.0      0.0 delivery
		  27m52s      170            0.0           16.8      0.0      0.0      0.0      0.0 newOrder
		  27m52s      170            1.0            1.7 103079.2 103079.2 103079.2 103079.2 orderStatus
		  27m52s      170            0.0           16.7      0.0      0.0      0.0      0.0 payment
		  27m52s      170            0.0            1.6      0.0      0.0      0.0      0.0 stockLevel
		E190304 07:50:10.812308 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.45:26257: connect: connection refused
		: signal: killed
	cluster.go:1585,cdc.go:212,cdc.go:382,test.go:1211: dial tcp 35.231.18.91:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/959dcf7de0f94cfcfa0062387b109adebd1f11da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163702&tab=buildLog

The test failed on master:
	cluster.go:1244,cdc.go:600,cdc.go:117,cdc.go:382,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1163702-cdc-initial-scan-rangefeed-true:4 -- ./workload fixtures load tpcc --warehouses=100 --checks=false {pgurl:1} returned:
		stderr:
		
		stdout:
		06 07:23:52.798603 119 ccl/workloadccl/fixture.go:527  loaded warehouse (2m55s, 100 rows, 0 index entries, 5.1 KiB)
		I190306 07:24:38.053411 120 ccl/workloadccl/fixture.go:527  loaded district (3m40s, 1000 rows, 0 index entries, 99 KiB)
		I190306 07:25:18.729035 123 ccl/workloadccl/fixture.go:527  loaded order (4m21s, 3000000 rows, 6000000 index entries, 166 MiB)
		I190306 07:26:16.260390 122 ccl/workloadccl/fixture.go:527  loaded history (5m18s, 3000000 rows, 6000000 index entries, 380 MiB)
		I190306 07:26:25.145795 127 ccl/workloadccl/fixture.go:527  loaded order_line (5m27s, 29995875 rows, 29995875 index entries, 2.0 GiB)
		I190306 07:26:25.512091 126 ccl/workloadccl/fixture.go:527  loaded stock (5m28s, 10000000 rows, 10000000 index entries, 3.1 GiB)
		I190306 07:26:25.512487 121 ccl/workloadccl/fixture.go:527  loaded customer (5m28s, 3000000 rows, 3000000 index entries, 1.7 GiB)
		Error: restoring fixture: PostLoad hook: pq: count-leases: AS OF SYSTEM TIME: cannot specify timestamp in the future
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e6f0a720a98c60ed0545e7cbb4b2f224b463cae9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165130&tab=buildLog

The test failed on master:
	cluster.go:1244,cdc.go:612,cdc.go:124,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1165130-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		1.5      0.0      0.0      0.0      0.0 delivery
		  17m38s      219            0.0           15.2      0.0      0.0      0.0      0.0 newOrder
		  17m38s      219            0.0            1.5      0.0      0.0      0.0      0.0 orderStatus
		  17m38s      219            0.0           15.1      0.0      0.0      0.0      0.0 payment
		  17m38s      219            0.0            1.3      0.0      0.0      0.0      0.0 stockLevel
		E190307 07:44:33.180664 1 workload/cli/run.go:419  error in orderStatus: dial tcp 10.142.0.159:26257: connect: connection refused
		  17m39s      245            0.0            1.5      0.0      0.0      0.0      0.0 delivery
		  17m39s      245            0.0           15.2      0.0      0.0      0.0      0.0 newOrder
		  17m39s      245            0.0            1.5      0.0      0.0      0.0      0.0 orderStatus
		  17m39s      245            0.0           15.0      0.0      0.0      0.0      0.0 payment
		  17m39s      245            0.0            1.3      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1603,cdc.go:212,cdc.go:382,test.go:1214: dial tcp 35.243.183.26:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f7de3c35348e847dc0d4ab1ba8d76574e1706c2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169980&tab=buildLog

The test failed on master:
	cluster.go:1244,cdc.go:612,cdc.go:124,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1169980-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		     1.3      0.0      0.0      0.0      0.0 delivery
		   29m9s      284            5.0           13.4 103079.2 103079.2 103079.2 103079.2 newOrder
		   29m9s      284            0.0            1.3      0.0      0.0      0.0      0.0 orderStatus
		   29m9s      284            5.0           13.3 103079.2 103079.2 103079.2 103079.2 payment
		   29m9s      284            0.0            1.2      0.0      0.0      0.0      0.0 stockLevel
		E190310 07:50:00.754133 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.62:26257: connect: connection refused
		  29m10s      311            0.0            1.3      0.0      0.0      0.0      0.0 delivery
		  29m10s      311            6.0           13.4 103079.2 103079.2 103079.2 103079.2 newOrder
		  29m10s      311            0.0            1.3      0.0      0.0      0.0      0.0 orderStatus
		  29m10s      311            4.0           13.3 103079.2 103079.2 103079.2 103079.2 payment
		  29m10s      311            0.0            1.2      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1603,cdc.go:212,cdc.go:382,test.go:1214: dial tcp 35.231.151.90:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog

The test failed on master:
	cluster.go:1244,cdc.go:612,cdc.go:124,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1170795-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.0           15.2      0.0      0.0      0.0      0.0 newOrder
		  23m12s      223            0.0            1.5      0.0      0.0      0.0      0.0 orderStatus
		  23m12s      223            0.0           15.2      0.0      0.0      0.0      0.0 payment
		  23m12s      223            0.0            1.4      0.0      0.0      0.0      0.0 stockLevel
		E190311 06:48:00.167665 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.69:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  23m13s      235            0.0            1.5      0.0      0.0      0.0      0.0 delivery
		  23m13s      235            0.0           15.2      0.0      0.0      0.0      0.0 newOrder
		  23m13s      235            0.0            1.5      0.0      0.0      0.0      0.0 orderStatus
		  23m13s      235            0.0           15.2      0.0      0.0      0.0      0.0 payment
		  23m13s      235            0.0            1.4      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1603,cdc.go:212,cdc.go:382,test.go:1214: dial tcp 35.231.151.174:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9d058d53c8a82fceb2205f1827c26f1bf36c32ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1172386&tab=buildLog

The test failed on master:
	cluster.go:1244,cdc.go:613,cdc.go:124,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1172386-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.0           13.8 103079.2 103079.2 103079.2 103079.2 newOrder
		  28m12s      301            0.0            1.4      0.0      0.0      0.0      0.0 orderStatus
		  28m12s      301            1.0           13.8 103079.2 103079.2 103079.2 103079.2 payment
		  28m12s      301            0.0            1.2      0.0      0.0      0.0      0.0 stockLevel
		E190312 06:43:43.035924 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.72:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  28m13s      376            0.0            1.4      0.0      0.0      0.0      0.0 delivery
		  28m13s      376            0.0           13.8      0.0      0.0      0.0      0.0 newOrder
		  28m13s      376            0.0            1.4      0.0      0.0      0.0      0.0 orderStatus
		  28m13s      376            1.0           13.8 103079.2 103079.2 103079.2 103079.2 payment
		  28m13s      376            0.0            1.2      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1603,cdc.go:212,cdc.go:382,test.go:1214: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog

The test failed on master:
	cluster.go:1251,cdc.go:625,cdc.go:125,cluster.go:1589,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1176948-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		0      0.0      0.0 orderStatus
		  25m20s      209            0.0           16.3      0.0      0.0      0.0      0.0 payment
		  25m20s      209            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		E190314 06:45:16.134437 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.125:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  25m21s      230            0.0            1.6      0.0      0.0      0.0      0.0 delivery
		  25m21s      230            0.0           16.3      0.0      0.0      0.0      0.0 newOrder
		  25m21s      230            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  25m21s      230            0.0           16.2      0.0      0.0      0.0      0.0 payment
		  25m21s      230            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		E190314 06:45:17.575477 1 workload/cli/run.go:419  error in payment: dial tcp 10.142.0.125:26257: connect: connection refused
		: signal: killed
	cluster.go:1610,cdc.go:213,cdc.go:383,test.go:1214: dial tcp 35.231.194.107:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/70e3468e7ed5fb30b10eaaf972acbb0f16099d01

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178890&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1178890-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   0.0 newOrder
		  13m37s      219            0.0            0.6      0.0      0.0      0.0      0.0 orderStatus
		  13m37s      219            0.0            5.8      0.0      0.0      0.0      0.0 payment
		  13m37s      219            0.0            0.4      0.0      0.0      0.0      0.0 stockLevel
		E190315 06:31:52.083973 1 workload/cli/run.go:420  error in orderStatus: dial tcp 10.142.0.123:26257: connect: connection refused
		  13m38s      235            0.0            0.6      0.0      0.0      0.0      0.0 delivery
		  13m38s      235            0.0            5.7      0.0      0.0      0.0      0.0 newOrder
		  13m38s      235            0.0            0.6      0.0      0.0      0.0      0.0 orderStatus
		  13m38s      235            0.0            5.8      0.0      0.0      0.0      0.0 payment
		  13m38s      235            0.0            0.4      0.0      0.0      0.0      0.0 stockLevel
		E190315 06:31:53.083978 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.123:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:213,cdc.go:383,test.go:1214: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c2939ec9a4f15b7fb8683a5805deeb241953e7aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1182991&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:612,cdc.go:118,cdc.go:383,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1182991-cdc-initial-scan-rangefeed-true:4 -- ./workload fixtures load tpcc --warehouses=100 --checks=false {pgurl:1} returned:
		stderr:
		
		stdout:
		I190318 06:22:28.696417 1 ccl/workloadccl/cliccl/fixtures.go:292  starting load of 9 tables
		Error:  exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f95d45653df6be5587fb9887de241f50b6932000

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1185396&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1185396-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		   0.0 newOrder
		  17m22s      219            0.0            0.7      0.0      0.0      0.0      0.0 orderStatus
		  17m22s      219            0.0            7.3      0.0      0.0      0.0      0.0 payment
		  17m22s      219            0.0            0.5      0.0      0.0      0.0      0.0 stockLevel
		E190319 06:41:03.956955 1 workload/cli/run.go:420  error in orderStatus: dial tcp 10.142.0.176:26257: connect: connection refused
		  17m23s      251            0.0            0.7      0.0      0.0      0.0      0.0 delivery
		  17m23s      251            0.0            7.5      0.0      0.0      0.0      0.0 newOrder
		  17m23s      251            0.0            0.7      0.0      0.0      0.0      0.0 orderStatus
		  17m23s      251            0.0            7.3      0.0      0.0      0.0      0.0 payment
		  17m23s      251            0.0            0.5      0.0      0.0      0.0      0.0 stockLevel
		E190319 06:41:04.957052 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.176:26257: connect: connection refused
		: signal: killed
	cluster.go:1626,cdc.go:213,cdc.go:383,test.go:1214: dial tcp 35.196.204.231:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1187480-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.7      0.0      0.0      0.0      0.0 delivery
		   16m2s      195            0.0            7.7      0.0      0.0      0.0      0.0 newOrder
		   16m2s      195            0.0            0.8      0.0      0.0      0.0      0.0 orderStatus
		   16m2s      195            0.0            7.5      0.0      0.0      0.0      0.0 payment
		   16m2s      195            0.0            0.6      0.0      0.0      0.0      0.0 stockLevel
		E190320 06:37:14.809395 1 workload/cli/run.go:420  error in orderStatus: dial tcp 10.142.15.206:26257: connect: connection refused
		   16m3s      215            0.0            0.7      0.0      0.0      0.0      0.0 delivery
		   16m3s      215            0.0            7.7      0.0      0.0      0.0      0.0 newOrder
		   16m3s      215            0.0            0.8      0.0      0.0      0.0      0.0 orderStatus
		   16m3s      215            0.0            7.5      0.0      0.0      0.0      0.0 payment
		   16m3s      215            0.0            0.6      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1626,cdc.go:213,cdc.go:383,test.go:1214: dial tcp 35.231.191.122:26257: connect: connection refused

@tbg
Copy link
Member

tbg commented Mar 20, 2019

^- I190320 06:30:56.262630 161 server/status/runtime.go:500 [n3] runtime stats: 13 GiB RSS, 973 goroutines, 8.0 GiB/711 MiB/9.2 GiB GO alloc/idle/total, 3.5 GiB/4.1 GiB CGO alloc/total, 2762.1 CGO/sec, 1551.0/4.3 %(u/s)time, 0.1 %gc (1x), 179 MiB/13 MiB (r/w)net
fatal error: runtime: out of memory

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2019

looks like both node 1 and 3 had high memory usage. i downloaded the memprof dumps in the artifacts but nothing jumped out at me.

node 1 taken at 06_29_32 when memory usage was around

I190320 06:29:32.973572 343 server/status/runtime.go:500  [n1] runtime stats: 12 GiB RSS, 967 goroutines, 7.0 GiB/688 MiB/8.1 GiB GO alloc/idle/total, 3.5 GiB/4.1 GiB CGO alloc/total, 936.6 CGO/sec, 1588.6/2.2 %(u/s)time, 0.0 %gc (1x), 23 MiB/12 MiB (r/w)net
(pprof) top10
Showing nodes accounting for 3849.51MB, 97.14% of 3962.90MB total
Dropped 449 nodes (cum <= 19.81MB)
Showing top 10 nodes out of 116
      flat  flat%   sum%        cum   cum%
 2361.49MB 59.59% 59.59%  2361.49MB 59.59%  github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanResponse).Unmarshal
  823.12MB 20.77% 80.36%   823.12MB 20.77%  github.com/cockroachdb/cockroach/pkg/storage/engine.copyFromSliceVector
  207.48MB  5.24% 85.60%   207.48MB  5.24%  github.com/cockroachdb/cockroach/vendor/github.com/golang/leveldb/memfs.(*file).Write
  144.61MB  3.65% 89.25%   144.61MB  3.65%  github.com/cockroachdb/cockroach/pkg/storage/engine.gobytes
  115.01MB  2.90% 92.15%   115.01MB  2.90%  time.NewTimer
      96MB  2.42% 94.57%       96MB  2.42%  github.com/cockroachdb/cockroach/vendor/github.com/andy-kimball/arenaskl.NewArena
   62.87MB  1.59% 96.16%    62.87MB  1.59%  github.com/cockroachdb/cockroach/pkg/roachpb.(*ExportResponse_File).Unmarshal
   37.94MB  0.96% 97.11%    37.94MB  0.96%  github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*parser).recvMsg
    0.50MB 0.013% 97.13%    31.62MB   0.8%  github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*joinReader).performLookup
    0.50MB 0.013% 97.14%    22.04MB  0.56%  github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).StartScan

node 3 is similar. i'm not sure why the 3962.90MB covered by the memprof doesn't line up with the log line. ran the test locally and it worked like a charm

@jordanlewis
Copy link
Member

@danhhz I've just talked with some people about this issue (@lucy-zhang, @dt, some people on the Gophers slack). My current working theory is that the alloc memory in the Go mem stats is up to date, but the sum of the memory in a heap profile is only up to date as of the most recent full garbage collection. If you're currently looking at this, could you try collecting the heap profile with the argument to the pprof webserver that forces a GC first? I'll get to trying this test eventually as well, but if you're deep in it, just a suggestion.

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2019

The ones I was referencing were collected for me, but I have a copy of the test running right now if you want me to verify the theory. Do you know the argument offhand or should I go read some docs?

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2019

Oh, it's just ?gc or something like that. Update: it's ?gc=1

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2019

@jordanlewis maybe? running with ?gc=1 does seem to drop the number in the log line, but still doesn't seem to make pprof match. can't tell if it's just because a bunch of allocs happen between pprof returning and the log line printing

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2019

My method was tailing the log and repeatedly running

go tool pprof -top 10 "http://dan-cdc-0001.roachprod.crdb.io:26258/debug/pprof/heap?gc=1" | head

@jordanlewis
Copy link
Member

Hm, okay, too bad. Maybe the inuse_space profile was never designed to be a completely accurage picture?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1189954-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		.0            7.0      0.0      0.0      0.0      0.0 newOrder
		  14m16s      275            0.0            0.7      0.0      0.0      0.0      0.0 orderStatus
		  14m16s      275            0.0            7.0      0.0      0.0      0.0      0.0 payment
		  14m16s      275            0.0            0.4      0.0      0.0      0.0      0.0 stockLevel
		E190321 06:33:30.772800 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.74:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  14m17s      298            0.0            0.7      0.0      0.0      0.0      0.0 delivery
		  14m17s      298            0.0            7.0      0.0      0.0      0.0      0.0 newOrder
		  14m17s      298            0.0            0.7      0.0      0.0      0.0      0.0 orderStatus
		  14m17s      298            0.0            6.9      0.0      0.0      0.0      0.0 payment
		  14m17s      298            0.0            0.4      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1626,cdc.go:213,cdc.go:383,test.go:1214: unexpected node event: 2: dead

@danhhz
Copy link
Contributor

danhhz commented Mar 21, 2019

Seems like a similar problem as yesterday. I don't see anything new and helpful in the logs. Gonna try and repro with additional logging again.

@danhhz
Copy link
Contributor

danhhz commented Mar 21, 2019

After some digging, I feel pretty confidant that the last two failures are the same issue as the most recent comments in #35947. This potentially affects any tpcc cluster with stats, so I'm going to send a PR to disable auto stats in the changefeed tests until that gets sorted out. Every night's run is important data right now.

craig bot pushed a commit that referenced this issue Mar 21, 2019
36034: roachtest: temporarily disable auto stats collection for cdc tests r=tbg a=danhhz

Workaround for #35947. The optimizer currently plans a bad query for
TPCC when it has stats, so disable stats for now.

Touches #35327 where local tests saw this happen. Perhaps it's also been the cause of the last two nightly run failures.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@danhhz danhhz self-assigned this Mar 22, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1a5eabad4511a3371a6b2809d2bfc29e8aff66a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/initial-scan/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1224702&tab=buildLog

The test failed on master:
	cdc.go:176,cluster.go:1667,errgroup.go:57: read tcp 172.17.0.2:50972->35.237.239.129:26257: read: connection reset by peer
	cluster.go:1329,cdc.go:748,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1224702-cdc-initial-scan-rangefeed-true:4 -- ./workload run tpcc --warehouses=100 --duration=30m --tolerate-errors {pgurl:1-3}  returned:
		stderr:
		
		stdout:
		: signal: killed
	cluster.go:1688,cdc.go:223,cdc.go:496,test.go:1228: Goexit() was called

@danhhz
Copy link
Contributor

danhhz commented Apr 8, 2019

@tbg do you know what to make of this ^^? I don't see any failures in logs or dmesg

@danhhz
Copy link
Contributor

danhhz commented Apr 8, 2019

Looked at the debug.zip too and all I saw was a bunch of ranges marked as raft_log_too_large. Dunno what to make of that though

@tbg
Copy link
Member

tbg commented Apr 9, 2019

The error originates here:

if _, err := db.Exec(
`SET CLUSTER SETTING kv.closed_timestamp.target_duration='10s'`,
); err != nil {
t.Fatal(err)
}

This is talking to the first node. The first node seems to be running just fine.

https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1553980575046300 and recent discussions with @nvanbenschoten come to mind (though I think that latter one had to do with preparing a gazillion statements over a wide area, but I'm not sure the specific failure mode there was understood).

Either way, I would be mildly surprised if this were a CDC bug. But there's definitely some bug here, either in the networking setup (in roachtest or in CRDB) that sometimes creates obscure errors. Also cc @bdarnell who was also looking at similar issues which I can't find any more.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 9, 2019

I was looking at "exit status 255" in #35337 (comment)

The most expected cause of "connection reset by peer" is if the server process you're talking to crashes (or a NAT/load balancer/proxy, although I don't think any of those are in play here). I'm not sure what could cause this if there's nothing in the logs or dmesg.

@danhhz
Copy link
Contributor

danhhz commented Apr 16, 2019

(Besides this last networking issue), this test hasn't failed since the last round of fixes went in. Gonna go ahead and finally close this. Any new failures are likely to be a new(ly discovered) issue at this point

@danhhz danhhz closed this as completed Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants