-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: dropping latches after failed raft reproposal is unsafe #115020
Comments
cc @cockroachdb/replication |
We neither have a reason to believe the other case is safe. If the snapshot jumps over LAIs that were still pending, we don't have a way of telling whether the snapshot contains them as applied or skipped. The comment above the check says this much. In fact, I believe this other case is a more plausible reason of the failure we're seeing. Agree that this code needs a clean-up either way. |
I think we do. If the snapshot jumps over LAIs that were still pending, we don't have a way of telling whether the snapshot contains them as applied or if the proposals were skipped. However, because of the |
Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
It's unclear whether this is a pre-existing bug or was introduced with the reproposals refactor (we suspect the former), so tentatively marking this as a GA blocker until we confirm. |
This same logic existed in Regarding a fix, I think the solution will involve the combination of:
|
@pavelkalinnikov Can you confirm @nvanbenschoten's comment above, and remove the GA-blocker label if appropriate? |
Yeah, the diff for the I'm looking at #106750 to see if there are some other changes that could cause this. In particular, there are some comments about poisoning and double calling a command done. I don't know if it's old or new. |
Agree. Even though we might be safe here (for instance, one of the error conditions in In the replica destruction case, for example, we could delegate dropping these proposals to the destruction procedure itself. I'm less clear about the cases when the error is for different reasons.
Maybe it's safest to not close proposals from this
|
So I'm leaning towards having a clearer design, like this:
Upd: forked design improvements to #116020. |
Pre-existing issue, removing the GA blocker. |
The following logic rejects a raft proposal with an AmbiguousResultError if an attempt to repropose it fails:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1491 to 1495 in 59fb4ec
In doing so, it calls
finishApplication
, which releases latches and cleans up the request.It's not clear how this is safe. I don't think it is. Unlike the other case where we reject requests during a raft reproposal attempt (here), on this path we have no strong reason to believe that the original proposal won't eventually succeed. If it could eventually succeed then dropping latches is unsafe, as it could allow conflicting requests to proceed and evaluate before the original request applies, only for the original request to later apply. This kind of race could lead to any number of issues, including stats inconsistencies and lost updates due to clobbered writes.
I think we want one of two things here.
One option is to ignore the error from
ReinsertLocked
and don't reject the proposal, allowing it to be reproposed again later. This may lead to requests like lease acquisitions getting stuck indefinitely in the proposals map, so we'd need to careful.The other option is to signal a result to the proposal without dropping latches. This is what we (correctly) do when poisoning requests:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1511 to 1514 in 59fb4ec
It's also possible that we never actually hit this error in practice and that the code is effectively dead. There are very few cases where
ReinsertLocked
returns an error. It only does when the replica is destroyed (at which point, all proposals are already rejected) and it does in rare cases when thepropBuf
is full and flushing it returns an error. So I might be making a big deal about a non-issue. Either way, we should fix the code to not look so error-prone.Original Slack discussion: https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1700688914732519?thread_ts=1700675982.566959&cid=C0KB9Q03D
Jira issue: CRDB-33844
Epic CRDB-37617
The text was updated successfully, but these errors were encountered: