-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: avoid RaftGroupDeletedError from RHS in splitTrigger #39571
Conversation
The right hand side of a split can be readded before the split trigger fires, in which case the split trigger fails. See [bug description]. I [suggested] a test to reprduce this bug "properly", so we should look into that. In the meantime, it'll be good to see that this passes tests. I verified manually that setting `minReplicaID` to some large number before the call to `rightRng.initRaftMuLockedReplicaMuLocked` reproduces the symptoms prior to this commit, but that doesn't come as a surprise nor does it prove that the fix works flawlessly. [bug description]: cockroachdb#21146 (comment) [suggested]: cockroachdb#39034 (comment) Fixes cockroachdb#21146. Release note (bug fix): Fixed a rare panic (message: "raft group deleted") that could occur during splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)
bors r=bdarnell |
Build failed |
acceptance/gossip-peerings failed with #39604 bors r=bdarnell |
Build failed |
It failed again, except this time it got stuck during fetching a bors r=bdarnell @jeffrey-xiao I'm going to send a PR to revert b320ff5. |
39571: storage: avoid RaftGroupDeletedError from RHS in splitTrigger r=bdarnell a=tbg The right hand side of a split can be readded before the split trigger fires, in which case the split trigger fails. See [bug description]. I [suggested] a test to reprduce this bug "properly", so we should look into that. In the meantime, it'll be good to see that this passes tests. I verified manually that setting `minReplicaID` to some large number before the call to `rightRng.initRaftMuLockedReplicaMuLocked` reproduces the symptoms prior to this commit, but that doesn't come as a surprise nor does it prove that the fix works flawlessly. [bug description]: #21146 (comment) [suggested]: #39034 (comment) Fixes #21146. Release note (bug fix): Fixed a rare panic (message: "raft group deleted") that could occur during splits. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build succeeded |
See cockroachdb#39571. The above PR was incomplete because we also need to wind back the replicaID. I was able to confirm that this *actually* works by running tests with this diff, which simulates an incoming raft message reflecting a high replicaID just before the split gets applied. ```diff diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go index 6d5a044657..981f3e1714 100644 --- a/pkg/storage/replica_raft.go +++ b/pkg/storage/replica_raft.go @@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock( func (r *Replica) acquireSplitLock( ctx context.Context, split *roachpb.SplitTrigger, ) (func(), error) { - rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil) + rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil) if err != nil { return nil, err } ``` @danhhz is working on a real test for this, so we'll soon have end-to-end coverage. Fixes cockroachdb#39651. Release note: None
39658: storage: avoid crash in splitPostApply r=danhhz a=tbg See #39571. The above PR was incomplete because we also need to wind back the replicaID. I was able to confirm that this *actually* works by running tests with this diff, which simulates an incoming raft message reflecting a high replicaID just before the split gets applied. ```diff diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go index 6d5a044657..981f3e1714 100644 --- a/pkg/storage/replica_raft.go +++ b/pkg/storage/replica_raft.go @@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock( func (r *Replica) acquireSplitLock( ctx context.Context, split *roachpb.SplitTrigger, ) (func(), error) { - rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil) + rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil) if err != nil { return nil, err } ``` @danhhz is working on a real test for this, so we'll soon have end-to-end coverage. Fixes #39651. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. Release note: None
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg This PR contains a series of commits that first pave for the way and ultimately allow carrying out atomic replication changes via Raft joint consensus. Atomic replication changes are required to avoid entering unsafe configurations during lateral data movement. See #12768 for details; this is a problem we want to address in 19.2. Before merging this we'll need to sort out an upstream change in Raft which has made a bug in our code related to learner snapshots much more likely; the offending upstream commit is patched out of the vendored etcd bump in this PR at the time of writing. An antichronological listing of the individual commits follows. They should be reviewed individually, though it may be helpful to look at the overall diff for overall context. A modest amount of churn may exist between the commits, though a good deal of effort went into avoiding this. storage: allow atomic replication changes in ChangeReplicas They default to OFF. This needs a lot more tests which will be added separately in the course of switching the default to ON and will focus on the interactions of joint states with everything else in the system. We'll also need another audit of consumers of the replica descriptors to make sure nothing was missed in the first pass. Release note: None storage: fix replicaGCQueue addition on removal trigger Once we enter joint changes, the replica to be removed will show up in `crt.Removed()` when the joint state is entered, but it only becomes eligible for actual removal when we leave the joint state later. The new code triggers at the right time, namely when the replica is no longer in the descriptor. Release note: None storage: let execChangeReplicasTxn construct the descriptor Prior to this commit, the method took both an old and a new desc *plus* slices of added and removed replicas. This had grown organically, wasn't an easily understood interface, led to repetitive and tricky code at the callers, and most importantly isn't adequate any more in a world with atomic replication changes, where execChangeReplicasTxn in constructing the ChangeReplicasTrigger is essentially deciding whether a joint configuration needs to be entered (which in turn determines what the descriptor needs to look like in the first place). To start solving this, let execChangeReplicasTxn create (and on success return) the new descriptor. Callers instead pass in what they want to be done, which is accomplished via an []internalReplicationChange slice. Release note: None roachpb: auto-assign ReplicaID during AddReplica This is a cleanup leading up to a larger refactor of the contract around `execChangeReplicasTxn`. Release note: None storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate This prepares the trigger -> raft translation code to properly handle atomic replication changes. This carries out a lot of validation to give us confidence that any unusual transitions would be caught quickly. This change also establishes more clearly which added and removed replicas are to be passed into the trigger when transitioning into a joint configuration. For example, when adding a voter, one technically replaces a Learner with a VoterIncoming and so the question is which type the replica in the `added` slice should have. Picking the Learner would give the trigger the most power to validate the input, but it's annoying to have divergent descriptors floating around, so by convention we say that it is always the updated version of the descriptor (i.e. for fully removed replicas, just whatever it was before it disappeared). I spent more time on this than I'm willing to admit, in particular looking removing the redundancy here, but it made things more awkward than was worth it. Release note: None storage: push replication change unrolling into ChangeReplicas There are various callers to ChangeReplicas, so it makes more sense to unroll at that level. The code was updated to - in principle - do the right thing when atomic replication changes are requested, except that they are still unimplemented and a fatal error will serve as a reminder of that. Of course nothing issues them yet. Release note: None storage: skip ApplyConfChange on rejected entry When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None storage: un-embed decodedConfChange I ate a number of NPEs during development because nullable embedded fields are tricky; they hide the pointer derefs that often need a nil check. We'll embed the fields of decodedConfChange instead which works out better. This commit also adds the unmarshaling code necessary for ConfChangeV2 needed once we issue atomic replication changes. Release note: None storage: add learners one by one Doing more than one change at once is going to force us into an atomic replication change. This isn't crazy, but seems unnecessary at this point, so just add the learners one by one. Release note: None storage: add fatals where atomic conf changes are unsupported These will be upgraded with proper handling when atomic replication changes are actually introduced, but for now it's convenient to stub out some code that will need to handle them and to make sure we won't forget to do so later. Release note: None storage: add atomic replication changes cluster setting This defaults to false, and won't have an effect unless the newly introduced cluster version is also active. Release note: None roachpb: support zero-change ChangeReplicasTrigger We will use a ChangeReplicasTrigger without additions and removals when transitioning out of a joint configuration, so make sure it supports this properly. Release note: None roachpb: return "desired" voters from ReplicaDescriptors.Voters Previous commits introduced (yet unused) voter types to encode joint consensus configurations which occur during atomic replication changes. Access to the slice of replicas is unfortunately common, though at least it's compartmentalized via the getters Voters() and Learners(). The main problem solved in this commit is figuring out what should be returned from Voters(): is it all VoterX types, or only voters in one of the two majority configs part of a joint quorum? The useful answer is returning the set of voters corresponding to what the config will be once the joint state is exited; this happens to be what most callers care about. Incoming and full voters are really the same thing in our code; we just need to distinguish them from outgoing voters to correctly maintain the quorum sizes. Of course there are some callers that do care about quorum sizes, and a number of cleanups were made for them. This commit also adds a ReplicaDescriptors.ConfState helper which is then used in all of the places that were previously cobbling together a ConfState manually. Release note: None roachpb: add ReplicaType_Voter{Incoming,Outgoing} These are required for atomic replication changes to describe joint configurations, i.e. configurations consisting of two sets of replica which both need to reach quorum to make replication decisions. An audit of existing consumers of this enum will follow. Release note: None roachpb: rename ReplicaType variants The current naming is idiomatic for proto enums, but atypical for its usage in Go code. There is no `(gogoproto.customname)` that can fix this, and we're about to add more replica types that would require awkward names such as `roachpb.ReplicaType_VOTER_OUTGOING`. Switch to a Go-friendly naming scheme instead. Release note: None batcheval: generalize checkNotLearnerReplica This now errors out whenever the replica is not a voter, which is more robust as new replica types are introduced (which generally should not automatically become eligible to receive leases). Release note: None roachpb: improve RangeDescriptor.Validate Make sure there isn't more than one replica per store. Release note: None roachpb: generalize ReplicaDescriptor.String() The new code will generalize to new replica types. Release note: None [dnm] vendor: bump raft This picks up upstream fixes related to atomic membership changes. I had to smuggle in a small hack because we're picking up etcd-io/etcd#11037 which makes a race between the snapshot queue and the proactive learner snapshot much more likely, and this in turn makes tests quite flaky because it turns out that if the learner snap loses, it can actually error out. Release note: None storage: avoid fatal error from splitPostApply This is the next band-aid on top of #39658 and #39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. Release note: None 40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany Addresses part of #39938. Release note (cli change): Default cluster locality topologies for multi-node cockroach demo clusters. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None
39796: storage: avoid (one) fatal error from splitPostApply r=ajwerner a=tbg This is the next band-aid on top of #39658 and #39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. @danhhz could you take a stab at replicating this in the test you added recently? It's probably too ambitious but it'd be nice to verify this is actually fixed (I didn't). I haven't seen this fail on master (though I also didn't look very hard) so maybe this is rare (i.e. we have time), but then I just got it doing nothing related in particular. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
The right hand side of a split can be readded before the split trigger
fires, in which case the split trigger fails.
See bug description.
I suggested a test to reprduce this bug "properly", so we should look
into that. In the meantime, it'll be good to see that this passes tests.
I verified manually that setting
minReplicaID
to some large numberbefore the call to
rightRng.initRaftMuLockedReplicaMuLocked
reproducesthe symptoms prior to this commit, but that doesn't come as a surprise
nor does it prove that the fix works flawlessly.
Fixes #21146.
Release note (bug fix): Fixed a rare panic (message: "raft group
deleted") that could occur during splits.