-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: outstanding reproposal at a higher max lease index #97102
Comments
cc @cockroachdb/replication |
|
Possibly completely unrelated but we just saw a foreign key violation in #97141, similar issues have in the past been traced down to raft-level issues (losing writes or dirty writes) |
I was hoping this would shake out just by having reproposals in the system, #97173, but no - this has been running for some time. What I'm doing is similar, except you have 100 splits and much higher concurrency as well as 1kb writes. I'll add these in one by one to see what sticks. |
Reproed both this and closed timestamp regressions. Going to poke at them more tomorrow. |
This closed timestamp regression? #70894 |
Poking at a repro now. Idx 380281 triggered the assertion. I'm injecting LAI 1234 for a proportion of commands that would have correctly needed a much larger LAI. Notably, the previous command already shared the LAI. It's thus very likely that 380281 shouldn't have applied (it should have been rejected). So why are we hitting an assertion that says that it was "finished"? What do we know about the proposal at this point?
We know that it
I'm thinking we are maybe erroneously hitting the assertion in the case in which we
and that the problem didn't occur until I added write overload (which means reproposals are more likely to fail, the usual reason being that their closed timestamp is now above the proposal's write timestamp). |
Hoping to shake out something relevant for cockroachdb#97102 or cockroachdb#97141. Not intended for merge. Can maybe turn this into a metamorphic var down the road. For now running manually via `./experiment.sh`. To date, it hasn't produced anything. For the first runs, I ~immediately got a closed timestamp regression. This was puzzling, but then also disappeared, so I think my OSX clock might have been adjusting over those few minutes. Epic: none Release note: None
I'm also hitting this assertion now, presumably since I'm starting to artificially introduce reproposal failures.
|
Finally got a repro of "outstanding reproposal at a higher max lease index" with decent logging (this was unexpectedly painful because the pretty printer choked on a bunch of nested fields, there were data races, etc). Full info below. https://gist.github.com/f22d177c6465959f8976801bc332caa5 The log index that triggers the assertion as at idx=0x54e5c=347740, MLI 0x4d2. It's applying with a forced error - a NotLeaseholderError. This means the lease changed1. So we didn't try to repropose this command at all; it's a I think I understand how we're triggering the assertion unintentionally here. Let's say we have this history:
The command at 100 will get reproposed with a new lease index. But now the lease might also change hands, and another "old" copy of the command might slip in:
When we apply idx 102, we don't remove the command from The entry at idx 102 catches the forced error we see in the assertion: the lease was invalid. Nothing here has gone wrong except that the assertion fired- the fact that the lease is invalid implies that the reproposal (idx 103) will similarly fail with an error. (It would be a problem if it were allows to mutate the state machine, since at idx 102 we're telling the client that their command is definitely not going to have an effect). The basic problem is that there is "drift" between the assumptions between the log position at which the reproposal (idx 103) is spawned (from idx 100) and when we see another stale copy of idx 100 (at idx 102) - we assume the lease hasn't changed but as we see here, it can. I have a strong urge to make all this code more obvious but I need to stew a little bit over how exactly to accomplish that. Footnotes
|
With a fix added to the assertion, I've been running it for the last couple hours in the gceworker. It eventually crashed as the disk filled up. Will polish that fix and validate further. |
It's not Warning-worthy since this is hit in production, often under some kind of overload. And if we're logging this, might as well include enough information to identify the log entry for which a reproposal was attempted. This is helpful for investigations such as cockroachdb#70894, cockroachdb#97287, cockroachdb#97102, etc. Epic: none Release note: None
97353: kvserver: downgrade and improve a log message r=pavelkalinnikov a=tbg It's not Warning-worthy since this is hit in production, often under some kind of overload. And if we're logging this, might as well include enough information to identify the log entry for which a reproposal was attempted. This is helpful for investigations such as #70894, #97287, #97102, etc. Epic: none Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Describe the problem
Fatal error on
n1
(prob. leaseholder) on this line:cockroach/pkg/kv/kvserver/replica_application_state_machine.go
Line 238 in 866d58a
Could be related to recent changes, e.g. #94633.
To Reproduce
I caught this while running the following on a GCE worker (with master 9c41c48 build):
Jira issue: CRDB-24511
The text was updated successfully, but these errors were encountered: