Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/headroom/n4cpu16 failed #37163

Closed
cockroach-teamcity opened this issue Apr 26, 2019 · 35 comments
Closed

roachtest: tpcc/headroom/n4cpu16 failed #37163

cockroach-teamcity opened this issue Apr 26, 2019 · 35 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/a53852a8f6c02ca5573a22abc03c790326ef69ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1262103&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1445,tpcc.go:168,cluster.go:1783,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1262103-tpcc-headroom-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		  0.0      0.0      0.0      0.0 newOrder
		    3m0s        0            0.0            1.9      0.0      0.0      0.0      0.0 orderStatus
		    3m0s        0            0.0           21.0      0.0      0.0      0.0      0.0 payment
		    3m0s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		    3m1s        0            1.0            1.9  31138.5  31138.5  31138.5  31138.5 delivery
		    3m1s        0            4.0           15.3  38654.7  94489.3  94489.3  94489.3 newOrder
		    3m1s        0            0.0            1.9      0.0      0.0      0.0      0.0 orderStatus
		    3m1s        0            0.0           20.9      0.0      0.0      0.0      0.0 payment
		    3m1s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in stockLevel: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)
		Error:  exit status 1
		: exit status 1
	cluster.go:1804,tpcc.go:178,tpcc.go:243,test.go:1250: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Apr 26, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 26, 2019
@andreimatei
Copy link
Contributor

Looks like towards the end of the test the QPS for some types of transactions dropped to 0. And the we also failed to download the debug.zip.
@nvanbenschoten have you seen this before?

@nvanbenschoten
Copy link
Member

No I haven't, but I wouldn't be surprised if this is related to what we're seeing in #37199.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@nvanbenschoten
Copy link
Member

The failures over the past 3 days are because of #37590.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@nvanbenschoten
Copy link
Member

Previous three issues addressed by #37701.

@cockroach-teamcity

This comment has been minimized.

@nvanbenschoten
Copy link
Member

Latest failure addressed by #37726.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/699f675c73f8420802f92e46f65e6dce52abc12f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1306272&tab=buildLog

The test failed on branch=release-19.1, cloud=aws:
	cluster.go:1516,tpcc.go:157,tpcc.go:159,tpcc.go:243,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1306272-tpcc-headroom-n4cpu16:4 -- ./workload fixtures import tpcc --warehouses=1470  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190524 07:27:23.869728 1 ccl/workloadccl/cliccl/fixtures.go:331  starting import of 9 tables
		Error: importing fixture: importing table warehouse: pq: expected tpcc version "2.1.0" but got "2.0.1"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_18.223.122.242_2019-05-24T07:27:22Z: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/db98d5fb943e0a45b3878bdf042838408e9aee40

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308285&tab=buildLog

The test failed on branch=release-19.1, cloud=aws:
	cluster.go:1516,tpcc.go:157,tpcc.go:159,tpcc.go:243,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308285-tpcc-headroom-n4cpu16:4 -- ./workload fixtures import tpcc --warehouses=1470  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190525 07:44:32.482117 1 ccl/workloadccl/cliccl/fixtures.go:331  starting import of 9 tables
		Error: importing fixture: importing table history: pq: expected tpcc version "2.1.0" but got "2.0.1"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_18.216.136.34_2019-05-25T07:44:31Z: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/db98d5fb943e0a45b3878bdf042838408e9aee40

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308281&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,tpcc.go:168,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308281-tpcc-headroom-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		      0.0      0.0      0.0      0.0 orderStatus
		   3m56s        0            0.0            9.7      0.0      0.0      0.0      0.0 payment
		   3m56s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   3m57s        0            0.0            1.1      0.0      0.0      0.0      0.0 delivery
		   3m57s        0            0.0            8.1      0.0      0.0      0.0      0.0 newOrder
		   3m57s        0            0.0            1.3      0.0      0.0      0.0      0.0 orderStatus
		   3m57s        0            0.0            9.7      0.0      0.0      0.0      0.0 payment
		   3m57s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in stockLevel: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.0.183_2019-05-25T15:02:25Z: exit status 1
		: exit status 1
	cluster.go:1875,tpcc.go:178,tpcc.go:243,test.go:1251: Goexit() was called

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@tbg
Copy link
Member

tbg commented Jun 4, 2019

Minimized comments above addressed by #38022.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5a88de2233e1405c0553f2d5380fd24218fac3d2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1324173&tab=buildLog

The test failed on branch=release-19.1, cloud=aws:
	cluster.go:1513,tpcc.go:154,tpcc.go:156,tpcc.go:240,test.go:1248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1324173-tpcc-headroom-n4cpu16:4 -- ./workload fixtures import tpcc --warehouses=1470  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190605 08:07:39.588147 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		Error: importing fixture: importing table district: pq: expected tpcc version "2.1.0" but got "2.0.1"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_18.222.30.141_2019-06-05T08:07:38Z: exit status 1
		: exit status 1

tbg added a commit to tbg/cockroach that referenced this issue Jun 5, 2019
craig bot pushed a commit that referenced this issue Jun 5, 2019
38022: roachtest: use 'fixtures import' from 'cockroach workload' r=danhhz a=tbg

This was missed during #37726.

Closes #37488.
Touches #37163.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cbd571c7bf2ffad3514334a410fa7a728b1f5bf0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1330352&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1513,tpcc.go:169,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1330352-tpcc-headroom-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		  8.8      0.0      0.0      0.0      0.0 newOrder
		   3m23s        0            0.0            1.4      0.0      0.0      0.0      0.0 orderStatus
		   3m23s        0            0.0           10.7      0.0      0.0      0.0      0.0 payment
		   3m23s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		   3m24s        0            0.0            1.1      0.0      0.0      0.0      0.0 delivery
		   3m24s        0            0.0            8.7      0.0      0.0      0.0      0.0 newOrder
		   3m24s        0            0.0            1.4      0.0      0.0      0.0      0.0 orderStatus
		   3m24s        0            0.0           10.6      0.0      0.0      0.0      0.0 payment
		   3m24s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in delivery: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XX000)
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.154.246_2019-06-08T15:59:59Z: exit status 1
		: exit status 1
	cluster.go:1872,tpcc.go:179,tpcc.go:244,test.go:1248: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cbd571c7bf2ffad3514334a410fa7a728b1f5bf0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1335643&tab=buildLog

The test failed on branch=provisional_201906071718_v19.1.2, cloud=gce:
	cluster.go:1513,tpcc.go:169,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1335643-tpcc-headroom-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 9.2      0.0      0.0      0.0      0.0 newOrder
		   3m22s        0            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		   3m22s        0            0.0           12.5      0.0      0.0      0.0      0.0 payment
		   3m22s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		   3m23s        0            0.0            1.4      0.0      0.0      0.0      0.0 delivery
		   3m23s        0            0.0            9.2      0.0      0.0      0.0      0.0 newOrder
		   3m23s        0            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		   3m23s        0            0.0           12.5      0.0      0.0      0.0      0.0 payment
		   3m23s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in stockLevel: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XX000)
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.232.32.46_2019-06-12T06:00:25Z: exit status 1
		: exit status 1
	cluster.go:1872,tpcc.go:179,tpcc.go:244,test.go:1248: Goexit() was called

@andreimatei
Copy link
Contributor

The last two failures (on release-19.1 both) show the transport is closing error.
I've looked into the last one. It looks like the cluster became very unhappy at the end. Node 1 has this:

W190612 06:03:38.732203 158527 storage/intentresolver/intent_resolver.go:822  [n1,s1,r51/1:/Table/60{-/1/405/5}] failed to gc transaction record: could not GC completed transaction anchored at /Table/60/1/80/8/0: context canceled

and then nothing for 6 minutes and then

W190612 06:09:03.748990 374 storage/store.go:3704  [n1,s1,r1114/1:/Table/59/1/59{6/7/4…-8/6/1…}] handle raft ready: 3.0s [processed=0]
W190612 06:09:04.121102 159584 storage/replica_write.go:199  [n1,s1,r1428/1:/Table/54/1/31{8/449…-9/489…}] have been waiting 321.54s for proposing command ResolveIntent [/Table/54/1/319/32646/0,/Min),...
This range is likely unavailable.

And then tones of ctx canceled and network errors.

The other nodes have network errors starting around 06:03.

@tbg
Copy link
Member

tbg commented Jun 18, 2019

From the 6:01 goroutine dump at n1

$ panicparse -aggressive dump | grep -A 5 -E '^[0-9]{3,5}:'
202: select [Created by pgwire.(*conn).processCommandsAsync @ conn.go:515]
    intentresolver  intent_resolver.go:919                  (*IntentResolver).ResolveIntents(#1390, #132, *, *, *, *, *, *, 0, 0, ...)
    intentresolver  intent_resolver.go:808                  (*IntentResolver).cleanupFinishedTxnIntents(#1390, #3, *, *, *, *, *, *, *, 0, ...)
    intentresolver  intent_resolver.go:610                  (*IntentResolver).CleanupTxnIntentsAsync.func1(#3, *)
    intentresolver  intent_resolver.go:488                  (*IntentResolver).runAsyncTask(#1390, #3, *, 0x1, *, *, 0x4)
    intentresolver  intent_resolver.go:603                  (*IntentResolver).CleanupTxnIntentsAsync(#1390, #3, *, *, *, 0x1, 0x1, 0x1, 0, *)
--
370: select [Created by grpc.(*Server).serveStreams.func1 @ server.go:678]
    intentresolver  intent_resolver.go:919                  (*IntentResolver).ResolveIntents(#1390, #132, *, *, *, *, *, *, 0, 0, ...)
    intentresolver  intent_resolver.go:808                  (*IntentResolver).cleanupFinishedTxnIntents(#1390, #3, *, *, *, *, *, *, *, 0, ...)
    intentresolver  intent_resolver.go:610                  (*IntentResolver).CleanupTxnIntentsAsync.func1(#3, *)
    intentresolver  intent_resolver.go:488                  (*IntentResolver).runAsyncTask(#1390, #3, *, 0x1, *, *, 0x4)
    intentresolver  intent_resolver.go:603                  (*IntentResolver).CleanupTxnIntentsAsync(#1390, #3, *, *, *, 0x1, 0x1, 0x1, 0, *)
--
552: IO wait [Created by netutil.(*Server).ServeWith @ net.go:136]
    poll            netpoll.go:173                          runtime_pollWait(*, 0x72, *)
    poll            fd_poll_runtime.go:85                   (*pollDesc).wait(*, 0x72, 0xffffffffffffff00, #113, #196)
    poll            fd_poll_runtime.go:90                   (*pollDesc).waitRead(*, *, 0x1000, 0x1000)
    poll            fd_unix.go:169                          (*FD).Read(*, *, 0x1000, 0x1000, 0, 0, 0)
    net             fd_unix.go:202                          (*netFD).Read(*, *, 0x1000, 0x1000, 0, *, 0x895c88)
--
291: sync.Cond.Wait [0~1 minutes] [Created by pgwire.(*conn).processCommandsAsync @ conn.go:515]
    sync            sema.go:510                             runtime_notifyListWait(*, *)
    sync            cond.go:56                              (*Cond).Wait(*)
    sql             conn_io.go:429                          (*StmtBuf).curCmd(*, 0, 0, 0, 0, 0)
    sql             conn_executor.go:1153                   (*connExecutor).run(*, #132, *, #277, 0x5400, 0x15000, #278, *, 0, 0)
    sql             conn_executor.go:436                    (*Server).ServeConn(#1220, #132, *, *, 0x5400, 0x15000, #278, *, 0, 0)
--
998: select [Created by stop.(*Stopper).RunLimitedAsyncTask @ stopper.go:379]
    intentresolver  intent_resolver.go:919                  (*IntentResolver).ResolveIntents(#1390, #132, *, *, *, *, *, *, 0, 0, ...)
    intentresolver  intent_resolver.go:808                  (*IntentResolver).cleanupFinishedTxnIntents(#1390, #133, #266, *, *, *, *, *, *, 0, ...)
    intentresolver  intent_resolver.go:610                  (*IntentResolver).CleanupTxnIntentsAsync.func1(#133, #266)
    stop            stopper.go:385                          (*Stopper).RunLimitedAsyncTask.func1(#4, #133, #266, #88, 0x2a, #1339, 0, 0, *)

There's also a heap profile (on n1) at 6:03 that shows a lots of memory tied up in ScanResponses:

image

This line

have been waiting 321.54s for proposing command ResolveIntent

is actually pretty interesting because it's supposed to only ever print times that are close to ~10s:

case <-slowTimer.C:
slowTimer.Read = true
log.Warningf(ctx, `have been waiting %.2fs for proposing command %s.

slowTimer := timeutil.NewTimer()
defer slowTimer.Stop()
slowTimer.Reset(base.SlowRequestThreshold)
tBegin := timeutil.Now()

The fact that it took more than 30x that does suggest that something horrible happened to the machine. The CPU seems to be working pretty hard in the minutes leading up to the failure

12 06:03:32.819717 472 server/status/runtime.go:500 [n1] runtime stats: 13 GiB RSS, 4220 goroutines, 7.6 GiB/223 MiB/8.3 GiB GO alloc/idle/total, 3.6 GiB/4.3 GiB CGO alloc/total, 10351.9 CGO/sec, 1513.2/12.8 %(u/s)time, 0.0 %gc (2x), 396 MiB/38 MiB (r/w)net

"1513.2% utime" basically means the 16 cpus are maxing out. But still, I think this is already part of the downfall here, and I can't imagine overloading a machine to the point where lots of goroutines don't get scheduled for 300s. There's nothing in dmesg, but I found this in sysctl:

Jun 12 06:05:28 teamcity-1335643-tpcc-headroom-n4cpu16-0001 systemd-udevd[659]: seq 5381 '/kernel/slab/radix_tree_node/cgroup/radix_tree_node(214:init.scope)' is taking a long time
Jun 12 06:07:27 teamcity-1335643-tpcc-headroom-n4cpu16-0001 systemd-udevd[659]: seq 5381 '/kernel/slab/radix_tree_node/cgroup/radix_tree_node(214:init.scope)' killed
Jun 12 06:07:33 teamcity-1335643-tpcc-headroom-n4cpu16-0001 systemd-udevd[659]: worker [4630] terminated by signal 9 (Killed)
Jun 12 06:07:34 teamcity-1335643-tpcc-headroom-n4cpu16-0001 systemd-udevd[659]: worker [4630] failed while handling '/kernel/slab/radix_tree_node/cgroup/radix_tree_node(214:init.scope)'

I tried googling for that message but with zero success. The kernel here is

Jun 12 05:05:24 ubuntu kernel: Linux version 4.15.0-1026-gcp (buildd@lgw01-amd64-037) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #27~16.04.1-Ubuntu SMP Fri Dec 7 09:59:47 UTC 2018 (Ubuntu 4.15.0-1026.27~16.04.1-gcp 4.15.18)

@tbg
Copy link
Member

tbg commented Jun 18, 2019

^- anyway, this udev message doesn't seem like something we could trigger with a bug in crdb. Maybe the vm was "live" migrated?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/fb50abbb49dc22d404100c925512364f330fb89b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1344398&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1513,tpcc.go:169,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1344398-tpcc-headroom-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		  17.7      0.0      0.0      0.0      0.0 newOrder
		   1m42s        0            1.0            3.0    906.0    906.0    906.0    906.0 orderStatus
		   1m42s        0           16.0           23.2  10737.4  12884.9  32212.3  32212.3 payment
		   1m42s        0            0.0            0.3      0.0      0.0      0.0      0.0 stockLevel
		   1m43s        0            0.0            2.2      0.0      0.0      0.0      0.0 delivery
		   1m43s        0            0.0           17.6      0.0      0.0      0.0      0.0 newOrder
		   1m43s        0            0.0            2.9      0.0      0.0      0.0      0.0 orderStatus
		   1m43s        0           11.0           23.0  10737.4  11811.2  11811.2  11811.2 payment
		   1m43s        0            0.0            0.3      0.0      0.0      0.0      0.0 stockLevel
		Error: error in delivery: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XX000)
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.229.32.40_2019-06-18T16:03:13Z: exit status 1
		: exit status 1
	cluster.go:1872,tpcc.go:179,tpcc.go:244,test.go:1248: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f1c9693da739fa5fc2c94d4d978fadd6710d17da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1371441&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190703-1371441/tpcc/headroom/n4cpu16/run_1
	cluster.go:1724,tpcc.go:173,cluster.go:2067,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562134406-13-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		    0.0      0.0      0.0 newOrder
		   1m54s        0            0.0            2.8      0.0      0.0      0.0      0.0 orderStatus
		   1m54s        0            0.0           25.9      0.0      0.0      0.0      0.0 payment
		   1m54s        0            0.0            0.3      0.0      0.0      0.0      0.0 stockLevel
		   1m55s        0            0.0            2.4      0.0      0.0      0.0      0.0 delivery
		   1m55s        0            0.0           14.9      0.0      0.0      0.0      0.0 newOrder
		   1m55s        0            0.0            2.8      0.0      0.0      0.0      0.0 orderStatus
		   1m55s        0            0.0           25.7      0.0      0.0      0.0      0.0 payment
		   1m55s        0            0.0            0.3      0.0      0.0      0.0      0.0 stockLevel
		Error: error in stockLevel: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.237.239.12_2019-07-03T10:24:22Z: exit status 1
		: exit status 1
	cluster.go:2088,tpcc.go:183,tpcc.go:248,test_runner.go:680: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190719-1396096/tpcc/headroom/n4cpu16/run_1
	cluster.go:1726,tpcc.go:173,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563517204-14-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		0.3      0.0      0.0      0.0      0.0 delivery
		  27m33s        0            0.0            2.6      0.0      0.0      0.0      0.0 newOrder
		  27m33s        0            0.0            0.3      0.0      0.0      0.0      0.0 orderStatus
		  27m33s        0            0.0            2.8      0.0      0.0      0.0      0.0 payment
		  27m33s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		  27m34s        0            0.0            0.3      0.0      0.0      0.0      0.0 delivery
		  27m34s        0            0.0            2.6      0.0      0.0      0.0      0.0 newOrder
		  27m34s        0            0.0            0.3      0.0      0.0      0.0      0.0 orderStatus
		  27m34s        0            0.0            2.8      0.0      0.0      0.0      0.0 payment
		  27m34s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in delivery: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XXUUU)
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,tpcc.go:183,tpcc.go:248,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/tpcc/headroom/n4cpu16/run_1
	cluster.go:1726,tpcc.go:173,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563776264-12-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		   0.0      0.0      0.0      0.0 delivery
		  25m16s        0            1.0            3.3 103079.2 103079.2 103079.2 103079.2 newOrder
		  25m16s        0            2.0            0.3 103079.2 103079.2 103079.2 103079.2 orderStatus
		  25m16s        0            2.0            3.3 103079.2 103079.2 103079.2 103079.2 payment
		  25m16s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  25m17s        0            0.0            0.3      0.0      0.0      0.0      0.0 delivery
		  25m17s        0            5.0            3.3 103079.2 103079.2 103079.2 103079.2 newOrder
		  25m17s        0            1.0            0.3 103079.2 103079.2 103079.2 103079.2 orderStatus
		  25m17s        0            3.0            3.3 103079.2 103079.2 103079.2 103079.2 payment
		  25m17s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in newOrder: EOF
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,tpcc.go:183,tpcc.go:248,test_runner.go:691: unexpected node event: 3: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7111a67b2ea3a19c2f312f8d214b8823f431cac0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1400942&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190723-1400942/tpcc/headroom/n4cpu16/run_1
	cluster.go:1726,tpcc.go:173,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563862417-14-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		0.3      0.0      0.0      0.0      0.0 delivery
		  25m14s        0            1.0            2.8 103079.2 103079.2 103079.2 103079.2 newOrder
		  25m14s        0            0.0            0.3      0.0      0.0      0.0      0.0 orderStatus
		  25m14s        0            0.0            2.8      0.0      0.0      0.0      0.0 payment
		  25m14s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		  25m15s        0            0.0            0.3      0.0      0.0      0.0      0.0 delivery
		  25m15s        0            0.0            2.8      0.0      0.0      0.0      0.0 newOrder
		  25m15s        0            0.0            0.3      0.0      0.0      0.0      0.0 orderStatus
		  25m15s        0            1.0            2.8 103079.2 103079.2 103079.2 103079.2 payment
		  25m15s        0            2.0            0.1 103079.2 103079.2 103079.2 103079.2 stockLevel
		Error: error in delivery: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XXUUU)
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,tpcc.go:183,tpcc.go:248,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/tpcc/headroom/n4cpu16/run_1
	cluster.go:1726,tpcc.go:173,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563949032-14-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		  0.0      0.0      0.0 delivery
		  25m39s        0            0.0            3.3      0.0      0.0      0.0      0.0 newOrder
		  25m39s        0            0.0            0.4      0.0      0.0      0.0      0.0 orderStatus
		  25m39s        0            0.0            3.3      0.0      0.0      0.0      0.0 payment
		  25m39s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		  25m40s        0            0.0            0.3      0.0      0.0      0.0      0.0 delivery
		  25m40s        0            0.0            3.3      0.0      0.0      0.0      0.0 newOrder
		  25m40s        0            0.0            0.4      0.0      0.0      0.0      0.0 orderStatus
		  25m40s        0            0.0            3.3      0.0      0.0      0.0      0.0 payment
		  25m40s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in stockLevel: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,tpcc.go:183,tpcc.go:248,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/headroom/n4cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog

The test failed on branch=provisional_201907241708_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404886/tpcc/headroom/n4cpu16/run_1
	cluster.go:1726,tpcc.go:173,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564034590-14-n4cpu16:4 -- ./workload run tpcc --warehouses=875 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		0.3      0.0      0.0      0.0      0.0 delivery
		   27m2s        0            0.0            3.1      0.0      0.0      0.0      0.0 newOrder
		   27m2s        0            0.0            0.3      0.0      0.0      0.0      0.0 orderStatus
		   27m2s        0            0.0            3.3      0.0      0.0      0.0      0.0 payment
		   27m2s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		   27m3s        0            0.0            0.3      0.0      0.0      0.0      0.0 delivery
		   27m3s        0            0.0            3.1      0.0      0.0      0.0      0.0 newOrder
		   27m3s        0            0.0            0.3      0.0      0.0      0.0      0.0 orderStatus
		   27m3s        0            0.0            3.3      0.0      0.0      0.0      0.0 payment
		   27m3s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
		Error: error in delivery: ERROR: rpc error: code = Unavailable desc = transport is closing (SQLSTATE XXUUU)
		Error:  exit status 1
		: exit status 1
	cluster.go:2090,tpcc.go:183,tpcc.go:248,test_runner.go:691: Goexit() was called

@nvanbenschoten
Copy link
Member

These recent failures have been identified as #39103.

@nvanbenschoten
Copy link
Member

Closing because the last chunk of failures were fixed by #39106 (see #39103). It's possible that a lot more of these were due to the same issue, as their logs do seem similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants