Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true/rangeTs=false failed #104698

Closed
cockroach-teamcity opened this issue Jun 10, 2023 · 2 comments · Fixed by #104699
Closed

roachtest: clearrange/checks=true/rangeTs=false failed #104698

cockroach-teamcity opened this issue Jun 10, 2023 · 2 comments · Fixed by #104699
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 10, 2023

roachtest.clearrange/checks=true/rangeTs=false failed with artifacts on master @ 61806f42e4c833b863ca9c2f62ce34918f1c8277:

test artifacts and logs in: /artifacts/clearrange/checks=true/rangeTs=false/run_1
(cluster.go:2176).Start: ~ COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-10477887-1686375972-52-n10cpu16/1686398519467235930?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
timeout: context deadline exceeded
Failed running "sql": COMMAND_PROBLEM: exit status 1
(test_runner.go:1154).func1: 3 dead node(s) detected

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-28676

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 10, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 10, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Jun 10, 2023
@irfansharif
Copy link
Contributor

W230610 12:01:38.384294 414 kv/kvserver/raft_transport.go:943 ⋮ [T1,n2] 42  while processing outgoing Raft queue to node 1: send msg error: ‹EOF›:
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39  queue for n3 does not exist
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !goroutine 645 [running]:
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/util/allstacks.GetWithBuf({0x0?, 0xc001076000?, 0x47203f?})
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/pkg/util/allstacks/allstacks.go:32 +0x85
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/util/allstacks.Get(...)
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/pkg/util/allstacks/allstacks.go:19
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc001374fa0, {{{0xc003336ae0, 0x24}, {0x6082869, 0x1}, {0x6082868, 0x1}, {0x6082869, 0x1}}, 0x17674a7fd5a92531, ...})
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:263 +0x99
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x7633950, 0xc00583b1d0}, 0x2, 0x4, 0x0, 0x0?, {0x604631c, 0x1c}, {0xc0032dfd30, 0x1, ...})
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:848
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).startProcessNewQueue.func2({0x7633950, 0xc00583b1d0})
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:911 +0x10d
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !runtime/pprof.Do({0x7633950?, 0xc00583b170?}, {{0xc00543a040?, 0xc0034f5eb8?, 0xc61866?}}, 0xc00543a000)
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
F230610 12:01:38.379474 645 kv/kvserver/raft_transport.go:911 â‹® [T1,n2] 39 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).startProcessNewQueue.func3({0x7633950, 0xc00583b170})

Same as #104696 (comment).

@irfansharif irfansharif self-assigned this Jun 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 10, 2023
@irfansharif irfansharif removed the T-storage Storage Team label Jun 10, 2023
craig bot pushed a commit that referenced this issue Jun 10, 2023
104699: kvserver: fix clearrange/* tests r=irfansharif a=irfansharif

Fixes #104696.
Fixes #104697.
Fixes #104698.
Part of #98703.

In 072c16d (added as part of #95637) we re-worked the locking structure around the RaftTransport's per-RPC class level send queues. When new send queues are instantiated or old ones deleted, we now also maintain the kvflowcontrol connection tracker, so such maintenance now needs to happen while holding a kvflowcontrol mutex. When rebasing \#95637 onto master, we accidentally included earlier queue deletion code without holding the appropriate mutex. Queue deletions now happened twice which made it possible to hit a RaftTransport assertion about expecting the right send queue to already exist.

Specifically, the following sequence was possible:
- `(*RaftTransport).SendAsync` is invoked, observes no queue for `<nodeid,class>`, creates it, and tracks it in the queues map.
  - It invokes an async worker W1 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done.
- W1 expects to find the tracked queue in the queues map, finds it, proceeds.
- W1 is done processing. On its way out, W1 clears `<nodeid,class>` from the queues map the first time.
- `(*RaftTransport).SendAsync` is invoked by another goroutine, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map.
  - It invokes an async worker W2 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done.
- W1 blindly clears the `<nodeid,class>` raft send queue the second time.
- W2 expects to find the queue in the queues map, but doesn't, and fatals.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@craig craig bot closed this as completed in 154b8d2 Jun 10, 2023
@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jun 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 13, 2023

cc @cockroachdb/replication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants