-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore2TB/nodes=10 failed #60479
Comments
Per @nvanbenschoten looks like deadlock in kv, potentially related to #59423 or #59086. Additional context in slack here: https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1613076549170200?thread_ts=1613072464.164800&cid=C0KB9Q03D |
Concretely, the issue here is ad78116, which introduced a lock ordering bug. The locking notes here indicate that the store mutex should be locked before the replica mutex, but that commit is doing the opposite. We see the resulting deadlock with these two goroutines:
The first goroutine locks:
The second goroutine locks:
I'm drafting a fix now. |
(roachtest).restore2TB/nodes=10 failed on master@9969862335cf501c1aa67a4aaf018ad5dc8e85e8:
|
60504: kv: fix lock ordering in Replica.applySnapshot r=tbg a=nvanbenschoten Fixes #60479. ad78116 was a nice improvement that avoided a number of tricky questions about exposing an inconsistent in-memory replica state during a snapshot. However, it appears to have introduced a deadlock due to lock ordering issues (see referenced issue). This commit fixes that issue by locking the Store mutex before the Replica mutex in `Replica.applySnapshot`'s combined critical section. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
(roachtest).restore2TB/nodes=10 failed on master@57de93a997db31d97cd7e23fb866fe0e44f90de2:
More
Artifacts: /restore2TB/nodes=10
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: