-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=true/rangeTs=false failed #104698
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
C-test-failure
Broken test (automatically or manually discovered).
O-roachtest
O-robot
Originated from a bot.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
T-kv
KV Team
Milestone
Comments
Same as #104696 (comment). |
craig bot
pushed a commit
that referenced
this issue
Jun 10, 2023
104699: kvserver: fix clearrange/* tests r=irfansharif a=irfansharif Fixes #104696. Fixes #104697. Fixes #104698. Part of #98703. In 072c16d (added as part of #95637) we re-worked the locking structure around the RaftTransport's per-RPC class level send queues. When new send queues are instantiated or old ones deleted, we now also maintain the kvflowcontrol connection tracker, so such maintenance now needs to happen while holding a kvflowcontrol mutex. When rebasing \#95637 onto master, we accidentally included earlier queue deletion code without holding the appropriate mutex. Queue deletions now happened twice which made it possible to hit a RaftTransport assertion about expecting the right send queue to already exist. Specifically, the following sequence was possible: - `(*RaftTransport).SendAsync` is invoked, observes no queue for `<nodeid,class>`, creates it, and tracks it in the queues map. - It invokes an async worker W1 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done. - W1 expects to find the tracked queue in the queues map, finds it, proceeds. - W1 is done processing. On its way out, W1 clears `<nodeid,class>` from the queues map the first time. - `(*RaftTransport).SendAsync` is invoked by another goroutine, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map. - It invokes an async worker W2 to process that send queue through `(*RaftTransport).startProcessNewQueue`. The async worker is responsible for clearing the tracked queue in the queues map once done. - W1 blindly clears the `<nodeid,class>` raft send queue the second time. - W2 expects to find the queue in the queues map, but doesn't, and fatals. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
cc @cockroachdb/replication |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
C-test-failure
Broken test (automatically or manually discovered).
O-roachtest
O-robot
Originated from a bot.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
T-kv
KV Team
roachtest.clearrange/checks=true/rangeTs=false failed with artifacts on master @ 61806f42e4c833b863ca9c2f62ce34918f1c8277:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=16
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-28676
The text was updated successfully, but these errors were encountered: