Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: TestImportCockroachDump #38242

Closed
cockroach-teamcity opened this issue Jun 17, 2019 · 4 comments
Closed

teamcity: failed test: TestImportCockroachDump #38242

cockroach-teamcity opened this issue Jun 17, 2019 · 4 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (testrace): TestImportCockroachDump

You may want to check for open issues.

#1344165:

TestImportCockroachDump
...store.go:3740  [n2,s2,r2/3:/System/NodeLiveness{-Max}] handle raft ready: 180.7s [processed=0]
W190617 21:13:54.358821 146519 gossip/client.go:120  [n1] failed to start gossip client to 127.0.0.1:36969: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s
W190617 21:13:54.677255 144210 storage/node_liveness.go:486  [n3,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
I190617 21:13:54.745865 143815 gossip/gossip.go:561  [n2] gossip status (stalled, 3 nodes)
gossip client (2/3 cur/max conns)
  0: 127.0.0.1:43859 (58s: infos 0/0 sent/received, bytes 0B/0B sent/received)
  0: 127.0.0.1:37663 (49s: infos 0/0 sent/received, bytes 0B/0B sent/received)
gossip server (0/3 cur/max conns, infos 12/63 sent/received, bytes 2374B/19936B sent/received)
W190617 21:13:54.750188 143464 server/node.go:816  [n2,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {StoreID:2 Category:METRICS Description:requests.slow.raft Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {StoreID:2 Category:METRICS Description:requests.slow.lease Value:2 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
I190617 21:13:54.800042 144201 gossip/gossip.go:561  [n3] gossip status (stalled, 3 nodes)
gossip client (1/3 cur/max conns)
  0: 127.0.0.1:37663 (50s: infos 0/0 sent/received, bytes 0B/0B sent/received)
gossip server (0/3 cur/max conns, infos 13/58 sent/received, bytes 2620B/18730B sent/received)
W190617 21:13:54.813079 144283 server/node.go:816  [n3,summaries] health alerts detected: {Alerts:[{StoreID:3 Category:METRICS Description:requests.slow.raft Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {StoreID:3 Category:METRICS Description:requests.slow.lease Value:2 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
W190617 21:13:54.858122 143824 storage/node_liveness.go:486  [n2,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W190617 21:13:54.910818 143058 storage/store_rebalancer.go:221  [n1,s1,store-rebalancer] StorePool missing descriptor for local store
I190617 21:13:54.932765 142848 gossip/gossip.go:561  [n1] gossip status (stalled, 3 nodes)
gossip client (0/3 cur/max conns)
gossip server (0/3 cur/max conns, infos 121/25 sent/received, bytes 38666B/5036B sent/received)
    testcluster.go:607: condition failed to evaluate within 45s: rpc error: code = Unavailable desc = transport is closing
        goroutine 142934 [running]:
        runtime/debug.Stack(0xa7a358200, 0xc009a96a80, 0x51c2ba0)
        	/usr/local/go/src/runtime/debug/stack.go:24 +0xab
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x528d0e0, 0xc00b334f00, 0xc007120660)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:47 +0x181
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForNodeStatuses(0xc002533c70, 0x528d0e0, 0xc00b334f00)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:607 +0xa7
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster(0x528d0e0, 0xc00b334f00, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:185 +0xbb7
        github.com/cockroachdb/cockroach/pkg/ccl/importccl.TestImportCockroachDump(0xc00b334f00)
        	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/importccl/import_stmt_test.go:2346 +0x1cf
        testing.tRunner(0xc00b334f00, 0x46d6768)
        	/usr/local/go/src/testing/testing.go:865 +0x164
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:916 +0x65b




Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jun 17, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jun 17, 2019
@tbg
Copy link
Member

tbg commented Jun 19, 2019

Hmm. Gossip is stalled, and lots of things are failing, but this stands out:

W190617 21:13:54.351319 144146 storage/store.go:3740 [n3,s3,r2/2:/System/NodeLiveness{-Max}] handle raft ready: 179.8s [processed=0]

That's brutal. Perhaps similar to the failure mode in #38246? Or even #37163 (comment) where nothing happens on one node for a full six minutes.

@tbg
Copy link
Member

tbg commented Jun 19, 2019

Ok in the log we actually see long stalls like that. Look at these runtime stats (that's all of them, though the test ran for minutes and they ought to be logged every 10s, by each node)

I190617 21:07:54.793034 142851 server/status/runtime.go:500  [n1] runtime stats: 2.5 GiB RSS, 646 goroutines, 348 MiB/67 MiB/471 MiB GO alloc/idle/total(stale), 83 MiB/122 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (168x), 2.0 MiB/2.0 MiB (r/w)net
I190617 21:10:51.677343 143818 server/status/runtime.go:500  [n2] runtime stats: 2.5 GiB RSS, 684 goroutines, 0 B/0 B/0 B GO alloc/idle/total(stale), 75 MiB/118 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (168x), 1.6 MiB/1.6 MiB (r/w)net
I190617 21:10:51.677548 144204 server/status/runtime.go:500  [n3] runtime stats: 2.5 GiB RSS, 656 goroutines, 0 B/0 B/0 B GO alloc/idle/total(stale), 75 MiB/118 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (168x), 1.4 MiB/1.4 MiB (r/w)net
I190617 21:10:51.983534 142851 server/status/runtime.go:500  [n1] runtime stats: 2.5 GiB RSS, 599 goroutines, 348 MiB/67 MiB/471 MiB GO alloc/idle/total(stale), 75 MiB/118 MiB CGO alloc/total, 3.9 CGO/sec, 34.4/5.0 %(u/s)time, 0.0 %gc (0x), 109 KiB/109 KiB (r/w)net
I190617 21:13:54.075773 143818 server/status/runtime.go:500  [n2] runtime stats: 2.5 GiB RSS, 542 goroutines, 377 MiB/44 MiB/471 MiB GO alloc/idle/total(stale), 76 MiB/121 MiB CGO alloc/total, 5.9 CGO/sec, 2.3/0.8 %(u/s)time, 0.3 %gc (1x), 17 KiB/17 KiB (r/w)net
I190617 21:13:54.168200 142851 server/status/runtime.go:500  [n1] runtime stats: 2.5 GiB RSS, 567 goroutines, 377 MiB/44 MiB/471 MiB GO alloc/idle/total(stale), 76 MiB/121 MiB CGO alloc/total, 7.3 CGO/sec, 7.3/1.6 %(u/s)time, 0.3 %gc (1x), 18 KiB/18 KiB (r/w)net
I190617 21:13:54.168826 144204 server/status/runtime.go:500  [n3] runtime stats: 2.5 GiB RSS, 533 goroutines, 377 MiB/44 MiB/471 MiB GO alloc/idle/total(stale), 76 MiB/121 MiB CGO alloc/total, 5.9 CGO/sec, 2.3/0.8 %(u/s)time, 0.3 %gc (1x), 17 KiB/17 KiB (r/w)net

I wonder what we're dealing with here. Taking into account these other issues I'm starting to suspect something fundamental. Perhaps the go1.12 bump or some Rocks patch that we applied? cc @ajkr / @petermattis for ideas on what to look for here and in the issues linked above.

@petermattis
Copy link
Collaborator

Taking into account these other issues I'm starting to suspect something fundamental. Perhaps the go1.12 bump or some Rocks patch that we applied?

Nothing immediate comes to mind. Certainly, a Go version bump is risky, which is why we did it before the end of the cycle, but minutes of inactivity on a node is not likely to be a Go bug. Perhaps we're pushing up against a memory limit. 2.5 GiB RSS seems pretty sizable for a test.

@tbg
Copy link
Member

tbg commented Jun 19, 2019

but minutes of inactivity on a node is not likely to be a Go bug.

We're talking minutes of inactivity across all goroutines in the running test binary. PS the 2.5GiB is process-wide. Have you taken a look at #37163 (comment)?

@tbg tbg closed this as completed Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants