Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: v22.1.4: nil pointer in poisonInflightLatches #86547

Closed
cockroach-teamcity opened this issue Aug 22, 2022 · 11 comments
Closed

kvserver: v22.1.4: nil pointer in poisonInflightLatches #86547

cockroach-teamcity opened this issue Aug 22, 2022 · 11 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 22, 2022

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3521479942/?referrer=webhooks_plugin

Panic message:

panic.go:1038: runtime error: invalid memory address or nil pointer dereference
(1) attached stack trace
-- stack trace:
| runtime.gopanic
| GOROOT/src/runtime/panic.go:1038
| runtime.panicmem
| GOROOT/src/runtime/panic.go:221
| runtime.sigpanic
| GOROOT/src/runtime/signal_unix.go:735
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).poisonInflightLatches
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1384
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:218
| github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
| github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1581
Wraps: (2) runtime error: invalid memory address or nil pointer dereference
Error types: (1) *withstack.withStack (2) runtime.errorString
-- report composition:
runtime.errorString
panic.go:1038: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets):

GOROOT/src/runtime/panic.go#L1037-L1039 in runtime.gopanic
GOROOT/src/runtime/panic.go#L220-L222 in runtime.panicmem
GOROOT/src/runtime/signal_unix.go#L734-L736 in runtime.sigpanic
https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1383-L1385 in pkg/kv/kvserver.(*Replica).poisonInflightLatches
https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go#L217-L219 in pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1

f(ctx)
}()
in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
GOROOT/src/runtime/asm_amd64.s#L1580-L1582 in runtime.goexit

GOROOT/src/runtime/panic.go in runtime.gopanic at line 1038
GOROOT/src/runtime/panic.go in runtime.panicmem at line 221
GOROOT/src/runtime/signal_unix.go in runtime.sigpanic at line 735
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).poisonInflightLatches at line 1384
pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go in pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1 at line 218
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 494
GOROOT/src/runtime/asm_amd64.s in runtime.goexit at line 1581
Tag Value
Cockroach Release v22.1.4
Cockroach SHA: 3c6c893
Platform linux amd64
Distribution CCL
Environment v22.1.4
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-18809

Epic CRDB-39898

@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Aug 22, 2022
@yuzefovich yuzefovich changed the title sentry: panic.go:1038: runtime error: invalid memory address or nil pointer dereference (1) attached stack trace -- stack trace: | runtime.gopanic | GOROOT/src/runtime/panic.go:1038 | runtime.panicme... kvserver: v22.1.4: nil pointer in poisonInflightLatches Aug 29, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 29, 2022
@exalate-issue-sync exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels Sep 19, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2022

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

Can you have a look @tbg?

@tbg
Copy link
Member

tbg commented Sep 20, 2022

This panics here:

if p.ec.g.Req.PoisonPolicy == poison.Policy_Error {

At this point we know that p.ec is not nil. Since .Req.PoisonPolicy all sits on *Guard, it must be the case that p.ec.g is nil:

type endCmds struct {
repl *Replica
g *concurrency.Guard
st kvserverpb.LeaseStatus // empty for follower reads
}

I poked around a bit and I'm not aware that we should ever have a proposal with a nil *Guard in the proposals map. It is possible for SequenceReq to return a nil *Guard, but this is the case in which it directly returns a request, and such proposals ought not to end up below raft.

@nvanbenschoten do you expect to see a nil Guard here?

@tbg
Copy link
Member

tbg commented Sep 26, 2022

@nvanbenschoten via DM

I did find that in endCmds.poison, we have an // Already cleared. branch. That indicates to me that it is possible to have a proposal with a nil *Guard in the proposals map.

What's implicit here is that Nathan does not expect endCmds to be created and put into the map with a nil Guard, it must've been zeroed out later, probably as part of ec.done().

@tbg
Copy link
Member

tbg commented Oct 10, 2022

Looked at this again a bit. Still haven't sussed it all out but I think it's reasonable to assume that some proposals in the map can be finished (so their endCmds are cleared) without strict sync with the proposals map. It's cleared only from p.finishApplication but that method has a few callers including from the proposal buf, which also has to deal with reproposals, etc.
Will scrutinize all of this code and probably send a PR that makes it a bit clearer on master. On the older releases (22.1 etc) we should just backport the nil check for p.ec.g.

@msbutler
Copy link
Collaborator

msbutler commented Dec 19, 2022

a heads up that this failure surfaced in a nightly run of backupccl's TestDataDriven/restore-schema-only-multiregion here. Unfortunately, we were not quick enough to triage before artifact deletion. Stressing this test on master has lead to a different failure I've outlined here related to server startup.

@stevendanna
Copy link
Collaborator

We've seen a similar looking failure again on TestDataDriven. We have aftifacts this time if they are helpful: https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_StressBazel/8179790:id/tmp/_tmp/95e138e66d69292427dfb9528cf06d04/logTestDataDriven4271694788/backupccltest.log

E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627  a panic has occurred!
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +runtime error: invalid memory address or nil pointer dereference
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +(1) attached stack trace
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  -- stack trace:
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.gopanic
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	GOROOT/src/runtime/panic.go:884
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.panicmem
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	GOROOT/src/runtime/panic.go:260
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.sigpanic
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	GOROOT/src/runtime/signal_unix.go:835
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).poisonInflightLatches
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1363
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:200
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.goexit
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | 	GOROOT/src/runtime/asm_amd64.s:1594

@tbg
Copy link
Member

tbg commented Jan 6, 2023

I think it will be easy to "fix" the bug - just add a nil check - but I think there is something to be understood yet. I am still not seeing how we can end up with a proposal in r.mu.proposals that is finished.

I went through all of the callers to endCmds.done below and verified that they all make sure that the proposal is not in the map by the time .done() is called.

image

Most of them "obviously" do (for some of them it's clear that the proposal is never added to the map in the first place).
There are two exceptions:

AckOutcomeAndFinish

At the beginning of command application, a replicaAppBatch will be created and will grab1 the proposals from the map (removing them). However, it doesn't do this for all proposals, only for those that aren't superseded by a reproposal with a higher LeaseAppliedIndex2. The superseded ones could be of interested - they are still in the map during application, so perhaps we're signalling them? There are in fact potential problems here.

The first is in the code below, which will signal all proposals, including superseded ones, meaning they will be signaled & stay in the map:

if err := t.applyOneBatch(ctx, iter); err != nil {
// If the batch threw an error, reject all remaining commands in the
// iterator to avoid leaking resources or leaving a proposer hanging.
//
// NOTE: forEachCmdIter closes iter.
if rejectErr := forEachCmdIter(ctx, iter, func(cmd Command, ctx context.Context) error {
return cmd.AckErrAndFinish(ctx, err)
}); rejectErr != nil {
return rejectErr
}
return err

The one error we expect to see from applyOneBatch (thus triggering this code path) is ErrRemoved. It's unlikely that we saw this one in this instance, since the system was basically defunct for over a minute and so replicaGC is not too likely, plus we don't see log messages that should have occurred prior to the crash. But I'll make a note that something is wrong here.

I also considered whether we might be signaling proposals that are in the map in the error path here:

if pErr != nil {
log.Warningf(ctx, "failed to repropose with new lease index: %s", pErr)
cmd.response.Err = pErr
} else {
// Unbind the entry's local proposal because we just succeeded
// in reproposing it and we don't want to acknowledge the client
// yet.
cmd.proposal = nil
return
}

but I had recently convinced myself2 that a proposal that hits this path was necessarily removed from the map already, in retrieveLocalProposals1.

refreshProposalsLocked

This one is more interesting. Even though that method holds the appropriate locks throughout and does remove commands from the map prior to finishing them (in the few cases in which it does do that), what it does do in the common case is hand proposals that are in the map back to the proposal buffer (without finishing them):

for _, p := range reproposals {
log.Eventf(p.ctx, "re-submitting command %x to Raft: %s", p.idKey, reason)
if err := r.mu.proposalBuf.ReinsertLocked(ctx, p); err != nil {

This is possibly problematic - we now have a (yet unfinished, presumably) proposal in both the map and the proposal buffer. Could the proposal now be applied, finished, removed from the map but then be reinserted3 due to being present in the proposal buffer? I'm actually not convinced this can happen, because the proposal buffer is flushed right at the beginning of raft processing (which includes entry application), meaning that by the time we might be applying the proposal the proposal buffer has already been emptied out4 and won't re-insert into the map.

err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup)


Next steps

The lifecycle of a proposal is pretty ad-hoc but there is clearly an invariant I am assuming should hold - that there isn't ever a finished proposal in the proposal map. We are not checking this invariant, but we should. We add to the map in a single place only3 and should verify this here. If we then still see the crash but not the assertion - we know the proposal was finished while remaining in the map, still in violation of the invariant.

As outlined above, we know of at least one invariant violation (the snippet in the AckOutcomeAndFinish section) but maybe I missed another one. Actually, revisiting the log of the failing test above, maybe it is replicaGC-related after all, since this is a multi-region test and I see the replicaGCQueue repeatedly run into the circuit breaker. But I'm still unsure why there isn't any logging5 then.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/a94858bff9a53450e0c76ff8ed8757fd3d18a264/pkg/kv/kvserver/replica_application_decoder.go#L100-L114 2

  2. better docs on this stuff incoming in https://github.com/cockroachdb/cockroach/pull/94633 2

  3. https://github.com/cockroachdb/cockroach/blob/a94858bff9a53450e0c76ff8ed8757fd3d18a264/pkg/kv/kvserver/replica_proposal_buf.go#L1183-L1191 2

  4. https://github.com/cockroachdb/cockroach/blob/0703807d018fb1c0f5352e8d83315b463c90d1e9/pkg/kv/kvserver/replica_destroy.go#L191-L216

  5. https://github.com/cockroachdb/cockroach/blob/35e0f4a6a13e6777e8d848100f4fa6311800c43f/pkg/kv/kvserver/store_remove_replica.go#L150

tbg added a commit to tbg/cockroach that referenced this issue Jan 10, 2023
The conjecture in cockroachdb#86547 is that a finished proposal somehow makes its
way into the proposal map, most likely by never being removed prior to
being finished.

This commit adds an assertion that we're never outright *inserting*
a finished proposals, and better documents the places in which we're
running a risk of violating the invariant.

It also clarifies the handling of proposals in an apply batch when a
replication change that removes the replica is encountered. I suspected
that this could lead to a case in which proposals would be finished
despite remaining in the proposals map. Upon inspection this turned out
to be incorrect - the map (at least by code inspection) is empty at that
point, so the invariant holds trivially.

Unfortunately, that leaves me without an explanation for cockroachdb#86547, but the
newly added invariants may prove helpful.

Touches cockroachdb#86547.
@msbutler
Copy link
Collaborator

@tbg, this unrelated test has been frequently failing due to this error. If you think the review process for #94825 will take some time, I may skip the test to avoid CI flakiness.

tbg added a commit to tbg/cockroach that referenced this issue Jan 24, 2023
I still don't understand how we can get a finished endCmds here, but
while I scratch my head we don't need to be collecting CI failures.

Touches cockroachdb#86547.

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Jan 24, 2023
95744: kvserver: don't NPE in poisonInflightLatches r=erikgrinaker a=tbg

I still don't understand how we can get a finished endCmds here, but
while I scratch my head we don't need to be collecting CI failures.

Touches #86547.
Closes #94209.

Epic: none
Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Feb 6, 2023
The conjecture in cockroachdb#86547 is that a finished proposal somehow makes its
way into the proposal map, most likely by never being removed prior to
being finished.

This commit adds an assertion that we're never outright *inserting*
a finished proposals, and better documents the places in which we're
running a risk of violating the invariant.

It also clarifies the handling of proposals in an apply batch when a
replication change that removes the replica is encountered. I suspected
that this could lead to a case in which proposals would be finished
despite remaining in the proposals map. Upon inspection this turned out
to be incorrect - the map (at least by code inspection) is empty at that
point, so the invariant holds trivially.

Unfortunately, that leaves me without an explanation for cockroachdb#86547, but the
newly added invariants may prove helpful.

Touches cockroachdb#86547.
craig bot pushed a commit that referenced this issue Feb 9, 2023
94825: kvserver: prevent finished proposal from being present in proposals map r=nvanbenschoten a=tbg

The conjecture in #86547 is that a finished proposal somehow makes its
way into the proposal map, most likely by never being removed prior to
being finished.

This commit adds an assertion that we're never outright *inserting*
a finished proposals, and better documents the places in which we're
running a risk of violating the invariant.

It also clarifies the handling of proposals in an apply batch when a
replication change that removes the replica is encountered. I suspected
that this could lead to a case in which proposals would be finished
despite remaining in the proposals map. Upon inspection this turned out
to be incorrect - the map (at least by code inspection) is empty at that
point, so the invariant holds trivially.

Unfortunately, that leaves me without an explanation for #86547, but the
newly added invariants may prove helpful.

Touches #86547.

Epic: None
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@tbg
Copy link
Member

tbg commented Mar 15, 2023

Unassigning since I'm no longer working on this and we've made the code path resilient to avoid the crash. It's likely that Epic CRDB-25287 will solve the underlying problem, which likely had to do with a finished proposal being re-inserted into the proposals map, which is something we know is possible at the moment. (This is not thought to cause double-application but can cause crashes in code that assumes it's not possible).

@tbg
Copy link
Member

tbg commented Jul 18, 2023

The reproposal path has been simplified quite a bit as a result of CRDB-25287 and the problem is more or less understood:

https://github.com/cockroachdb/cockroach/pull/106750/files#diff-58a18041db5281b4abe232561907fa9683d5c58c6704679e82dc8964aab4ecfaR1304-R1316

The work-around is still in place above and below:

func (r *Replica) poisonInflightLatches(err error) {
r.raftMu.Lock()
defer r.raftMu.Unlock()
r.mu.Lock()
defer r.mu.Unlock()
for _, p := range r.mu.proposals {
p.ec.poison()
// TODO(tbg): find out how `p.ec.done()` can have been called at this point,
// See: https://github.com/cockroachdb/cockroach/issues/86547
if p.ec.g != nil && p.ec.g.Req.PoisonPolicy == poison.Policy_Error {
aErr := kvpb.NewAmbiguousResultError(err)
// NB: this does not release the request's latches. It's important that
// the latches stay in place, since the command could still apply.
p.signalProposalResult(makeProposalResultErr(aErr))
}
}
}

but armed with this comment

https://github.com/cockroachdb/cockroach/pull/106750/files#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R57-R73

we should feel comfortable going through another round of assertions (best using must1). Either way, no user should currently be impacted by this. I'll keep it open, though.

Footnotes

  1. https://github.com/cockroachdb/cockroach/pull/106508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

5 participants