Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: outstanding reproposal at a higher max lease index #97102

Closed
pav-kv opened this issue Feb 14, 2023 · 11 comments · Fixed by #97564
Closed

kvserver: outstanding reproposal at a higher max lease index #97102

pav-kv opened this issue Feb 14, 2023 · 11 comments · Fixed by #97564
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 14, 2023

Describe the problem

Fatal error on n1 (prob. leaseholder) on this line:

log.Fatalf(ctx, "finishing proposal with outstanding reproposal at a higher max lease index: %+v", cmd)

F230214 09:41:18.245640 428 kv/kvserver/replica_application_state_machine.go:238 ⋮ [T1,n1,s1,r155/1:‹/Table/106/1/-7{76224…-57960…}›,raft] 13077  finishing proposal with outstanding reproposal at a higher max lease index: ‹Term:9 Index:13210 Type:EntryNormal Data:"\000dX\340\332\006\356\361\3140\007j\nB\000\222\001\005`\212\002h\002r\227\001\n\224\001\000\000\000\000\000\000\000\000\001\000\000\000\001&\001k\022\362\211\200\224\270\274P{\232h\373\210\000\001txn-\236\023\372a\221\014H\210\204}\017h\202*\217\"\000_\022\004\010\000\020\000\030\000 \000(\0002QF\273B\344\003\n>\n\020\236\023\372a\221\014H\210\204}\017h\202*\217\"\032\014\362\211\200\224\270\274P{\232h\373\210*\n\010\375\320\302\316\276\360\351\241\0270\341\300\016J\n\010\375\320\302\316\276\360\351\241\027P\001*\n\010\260\207\324\253\302\360\351\241\027 \277c\212\001\n\010\263\365\340\216\266\360\351\241\027"›

Could be related to recent changes, e.g. #94633.

To Reproduce

I caught this while running the following on a GCE worker (with master 9c41c48 build):

roachprod create -n3 local
roachprod put local ./cockroach
roachprod start --env "COCKROACH_ROCKSDB_CONCURRENCY=5" --env "COCKROACH_LOG_SST_INFO_TICKS_INTERVAL=6" local

workload run kv --init --histograms=perf/stats.json --concurrency=8196 --splits=100 --duration=30m0s --read-percent=0 --min-block-bytes=1024 --max-block-bytes=1024 --tolerate-errors=true 'postgres://root@127.0.0.1:26257?sslmode=disable'

Jira issue: CRDB-24511

@pav-kv pav-kv added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 14, 2023
@pav-kv
Copy link
Collaborator Author

pav-kv commented Feb 14, 2023

@tbg

@tbg tbg self-assigned this Feb 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 14, 2023

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator Author

pav-kv commented Feb 14, 2023

F230214 09:41:18.245640 428 kv/kvserver/replica_application_state_machine.go:238 ⋮ [T1,n1,s1,r155/1:‹/Table/106/1/-7{76224…-57960…}›,raft] 13077 !goroutine 428 [running]:
!github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
!       github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x89
!github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000d1b160, {{{0xc00221e240, 0x24}, {0x0, 0x0}, {0x58d794e, 0x1}, {0x58d794e, 0x1}}, 0x1743a78586c71809, ...})
!       github.com/cockroachdb/cockroach/pkg/util/log/clog.go:262 +0x97
!github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x6cefe88, 0xc07f7268a0}, 0x2, 0x4, 0x0, 0x0?, {0x5a01d0d, 0x4f}, {0xc0115ae1a8, 0x1, ...})
!       github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
!github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
!       github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
!github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
!       github.com/cockroachdb/cockroach/bazel-out/k8-fastbuild/bin/pkg/util/log/log_channels_generated.go:848
!github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc00741a100, {0x6cefe88, 0xc07f7268a0}, {0x6d305c8?, 0xc002c1a508})
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:238 +0x6e5
!github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f87c86c4318?, 0xc00741a4d0}, 0xc0115ae7d0)
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:210 +0x158
!github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc0115aecc8, {0x6cefe88, 0xc00741e060}, {0x6d20758, 0xc00741a470})
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:310 +0x1f8
!github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc0115aecc8, {0x6cefe88, 0xc00741e060})
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251 +0x9a
!github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1003 +0x13b9
!github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:698 +0x17b
!github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc000dae000, 0x6cefe88?)
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:643 +0x145
!github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc00378a900, {0x6cefe88, 0xc0038abbc0})
!       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:333 +0x235
!github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
!       github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x146
!created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
!       github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:461 +0x43b

@tbg
Copy link
Member

tbg commented Feb 14, 2023

Possibly completely unrelated but we just saw a foreign key violation in #97141, similar issues have in the past been traced down to raft-level issues (losing writes or dirty writes)

@tbg
Copy link
Member

tbg commented Feb 15, 2023

I was hoping this would shake out just by having reproposals in the system, #97173, but no - this has been running for some time.

What I'm doing is similar, except you have 100 splits and much higher concurrency as well as 1kb writes. I'll add these in one by one to see what sticks.

@tbg
Copy link
Member

tbg commented Feb 15, 2023

Reproed both this and closed timestamp regressions. Going to poke at them more tomorrow.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Feb 15, 2023

This closed timestamp regression? #70894

@tbg
Copy link
Member

tbg commented Feb 16, 2023

Poking at a repro now. Idx 380281 triggered the assertion. I'm injecting LAI 1234 for a proportion of commands that would have correctly needed a much larger LAI. Notably, the previous command already shared the LAI. It's thus very likely that 380281 shouldn't have applied (it should have been rejected). So why are we hitting an assertion that says that it was "finished"? What do we know about the proposal at this point?

    raft_log_debug_test.go:47: idx 380280: 15583bac3c1540c1: LAI 1234, CT 1676475378.493749807,0
    raft_log_debug_test.go:47: idx 380281: 0aac68f647254dba: LAI 1234, CT 1676475378.502348475,0
    raft_log_debug_test.go:47: idx 380282: 336f27e9c58b2e2e: LAI 1234, CT 1676475378.499693666,0

We know that it IsLocal(), and we know that the entry from the raft log has a smaller LAI than the proposal, which we interpret as saying that a new copy of the proposal should be in the log. Fine, we can check that. The CmdID is 0aac68f647254dba, so do we see that again in the log? We find one other copy, but it's in the past (380228 < 380281), so this one isn't it. So there isn't a future copy.

    raft_log_debug_test.go:47: idx 380228: 0aac68f647254dba: LAI 1234, CT 1676475378.502348475,0

I'm thinking we are maybe erroneously hitting the assertion in the case in which we tryReproposeWithNewLeaseIndex but fail (for example due to closed timestamp). Supporting evidence for this is that we see lots of these in the logs

failed to repropose with new lease index

and that the problem didn't occur until I added write overload (which means reproposals are more likely to fail, the usual reason being that their closed timestamp is now above the proposal's write timestamp).

tbg added a commit to tbg/cockroach that referenced this issue Feb 16, 2023
Hoping to shake out something relevant for
cockroachdb#97102 or
cockroachdb#97141.

Not intended for merge. Can maybe turn this into a metamorphic var down
the road. For now running manually via `./experiment.sh`.

To date, it hasn't produced anything. For the first runs, I ~immediately
got a closed timestamp regression. This was puzzling, but then also
disappeared, so I think my OSX clock might have been adjusting over
those few minutes.

Epic: none
Release note: None
@tbg
Copy link
Member

tbg commented Feb 16, 2023

I'm also hitting this assertion now, presumably since I'm starting to artificially introduce reproposal failures.

F230216 09:12:21.773058 296 kv/kvserver/replica_application_result.go:137 ⋮ [T1,n1,s1,r60/1:‹/Table/106{-/1/-461…}›,raft] 11315 failed reproposal unexpectedly in proposals map:

@tbg
Copy link
Member

tbg commented Feb 16, 2023

Finally got a repro of "outstanding reproposal at a higher max lease index" with decent logging (this was unexpectedly painful because the pretty printer choked on a bunch of nested fields, there were data races, etc).

Full info below.

https://gist.github.com/f22d177c6465959f8976801bc332caa5

The log index that triggers the assertion as at idx=0x54e5c=347740, MLI 0x4d2. It's applying with a forced error - a NotLeaseholderError. This means the lease changed1. So we didn't try to repropose this command at all; it's a ProposalRejectionPermanent error.

I think I understand how we're triggering the assertion unintentionally here. Let's say we have this history:

[lease seq is 1]
idx 99: unrelated cmd at LAI 10000, lease seq = 1
idx 100: cmd X at LAI 10000, lease seq = 1
idx 100: cmd X at LAI 10000, lease seq = 1

The command at 100 will get reproposed with a new lease index. But now the lease might also change hands, and another "old" copy of the command might slip in:

[lease seq is 1]
idx 99: unrelated cmd at LAI 10000, lease seq = 1
idx 100: cmd X at LAI 10000, lease seq = 1
idx 101: new lease seq=2
idx 102: cmd X at LAI 10000, lease seq = 1
idx 103: cmd X at LAI 20000, lease seq = 1

When we apply idx 102, we don't remove the command from r.mu.proposals in retrieveLocalProposals2 because it is "superseded" (in-memory LAI is 20000, indicating there's a new version of the proposal inflight already, we know it's idx 103 but the code doesn't).

The entry at idx 102 catches the forced error we see in the assertion: the lease was invalid.

Nothing here has gone wrong except that the assertion fired- the fact that the lease is invalid implies that the reproposal (idx 103) will similarly fail with an error. (It would be a problem if it were allows to mutate the state machine, since at idx 102 we're telling the client that their command is definitely not going to have an effect).

The basic problem is that there is "drift" between the assumptions between the log position at which the reproposal (idx 103) is spawned (from idx 100) and when we see another stale copy of idx 100 (at idx 102) - we assume the lease hasn't changed but as we see here, it can.

I have a strong urge to make all this code more obvious but I need to stew a little bit over how exactly to accomplish that.

Footnotes

  1. we also use a NLHE when we bump into the closed ts in tryReproposeWithNewLeaseIndex, but we don't set this as a forced error, and it has a different message than the one here.

  2. https://github.com/cockroachdb/cockroach/blob/866d58a62d792f963ff963235406b53b4ccf597f/pkg/kv/kvserver/replica_application_decoder.go#L100-L114

@tbg
Copy link
Member

tbg commented Feb 16, 2023

With a fix added to the assertion, I've been running it for the last couple hours in the gceworker. It eventually crashed as the disk filled up. Will polish that fix and validate further.

tbg added a commit to tbg/cockroach that referenced this issue Feb 20, 2023
It's not Warning-worthy since this is hit in production, often under
some kind of overload.

And if we're logging this, might as well include enough information
to identify the log entry for which a reproposal was attempted.

This is helpful for investigations such as cockroachdb#70894, cockroachdb#97287, cockroachdb#97102, etc.

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Feb 23, 2023
97353: kvserver: downgrade and improve a log message r=pavelkalinnikov a=tbg

It's not Warning-worthy since this is hit in production, often under
some kind of overload.

And if we're logging this, might as well include enough information
to identify the log entry for which a reproposal was attempted.

This is helpful for investigations such as #70894, #97287, #97102, etc.

Epic: none
Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig craig bot closed this as completed in 13c58f6 Feb 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants