-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add (default-off) atomic replication changes #39936
Conversation
@nvanbenschoten sorry to drop that kind of review bomb on you, but hope the individual commits make it more manageable. If you spot anything that you'd like to have pulled out and reviewed separately, I'm happy to do so. At the same time, as you review, please point out commits that you're 100% ok with -- my plan for when this review hits substantial comments is to pull out and land everything that's uncontroversial and repeat that until nothing's left. |
Wow--looks like a lot of hard work went into this. Excited to test it out and see it in action. |
6b3f036
to
66fd3af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made it up through d658d21 (r10) so far. I'm ignoring storage: avoid fatal error from splitPostApply
(#39796) since it looks like it's being discussed elsewhere.
Commits that LGTM:
- 7a07ce0:
roachpb: generalize ReplicaDescriptor.String()
(mod nits) - eb7d146:
roachpb: improve RangeDescriptor.Validate
(mod nits) - 31fbc37:
batcheval: generalize checkNotLearnerReplica
- 334cf0d:
roachpb: rename ReplicaType variants
(once we're ok on naming) - 2af6f9e:
roachpb: add ReplicaType_Voter{Incoming,Outgoing}
I don't see any real blockers in the other commits, but they need one more round of reviews.
I'm continuing to review but wanted to let you pipeline some of the work.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 3 of 3 files at r5, 18 of 18 files at r6, 16 of 16 files at r7, 7 of 7 files at r8, 4 of 4 files at r9, 2 of 2 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
Gopkg.toml, line 42 at r2 (raw file):
name = "go.etcd.io/etcd" # branch = "master" branch = "crl-bump"
Is the need for this fixed by #39796? Are they related?
pkg/roachpb/data.go, line 1347 at r9 (raw file):
// configuration. func (crt ChangeReplicasTrigger) LeaveJoint() bool { return len(crt.Added())+len(crt.Removed()) == 0
Is it worth making this implicit instead of adding a new field to ChangeReplicasTrigger
?
pkg/roachpb/data.go, line 1369 at r9 (raw file):
// TODO(tbg): could list the replicas that will actually leave the // voter set. fmt.Fprintf(&chgS, "LEAVE_JOINT")
Are you missing a space at the end of this?
EDIT: nope. // NB: not missing a space.
might actually be useful.
pkg/roachpb/data.proto, line 163 at r9 (raw file):
// methods return the replicas being added and removed, respectively. If more than // one change is specified (i.e. len(Added())+len(Removed()) exceeds one), this // initiates an atomic replication change in which the "removed" replicas are
Same question about whether it's worth introducing the term "atomic replication change" in code when "joint replication change" is already all over.
Maybe you're saying that they aren't synonymous and that "atomic replication change" is a collection of individual replication changes that are made atomic through joint consensus?
pkg/roachpb/metadata.go, line 282 at r3 (raw file):
} if typ := r.GetType(); typ != ReplicaType_VOTER { buf.WriteString(strings.ToUpper(typ.String()))
Do we need the strings.ToUpper
? It looks like this should already be correctly cased.
EDIT: never mind, addressed later.
pkg/roachpb/metadata.go, line 232 at r4 (raw file):
} seen[rep.ReplicaID] = struct{}{} if rep.ReplicaID >= r.NextReplicaID {
tiny nit: move this check up so the ReplicaID
re-use check is next to the StoreID
re-use check.
pkg/roachpb/metadata.proto, line 39 at r6 (raw file):
// ReplicaType identifies which raft activities a replica participates in. enum ReplicaType {
Did you consider doing something with option (gogoproto.goproto_enum_prefix) = false;
instead? This does seem to be going against the convention in our code.
pkg/roachpb/metadata.proto, line 53 at r7 (raw file):
// to replicas 1 and 2, while 3 is VoterOutgoing and 4 is VoterIncoming, then // the two sets over which quorums need to be achieved are {1,2,3} and {1,2,4}. // Thus, {1,2} is a quorum of both, {1,3} is a quorum of the first but not the
nit: spell out the last part for clarity "but not the second, {1,4} is a quorum of the second but not the first, and"
pkg/roachpb/metadata.proto, line 76 at r7 (raw file):
// CockroachDB are a short-term transient state: a replica being added and on // its way to being a VOTER. Learner = 1;
We don't need a LearnerIncoming
state to mirror Config.LearnersNext
because there's no need to distinguish the two in Cockroach code? That's true even though raft.ConfState
splits them out?
Never mind - answered later on.
pkg/roachpb/metadata_replicas.go, line 271 at r8 (raw file):
} // InAtomicReplicationChange returns true if the descriptor is in the middle of
Do you think it's worth introducing "atomic replication change" as a term synonymous with "joint replication change"? Is one more descriptive than the other? Is it worth the confusion?
pkg/roachpb/metadata_replicas.go, line 321 at r8 (raw file):
// replication layer. This is more complicated than just counting the number // of replicas due to the existence of joint quorums. func (d ReplicaDescriptors) CanMakeProgress(liveFunc func(descriptor ReplicaDescriptor) bool) bool {
I think this deserves a test.
pkg/roachpb/metadata_replicas.go, line 325 at r8 (raw file):
var c int if n := len(d.wrapped); len(voters) == n { // Fast path when there are only full voters, i.e. the common case.
It took me a little to figure out why this fast-path wasn't the same as a check for !d.InAtomicReplicationChange()
. Mind spelling out the two cases it can be hit in.
- non-joint quorum
- joint quorum with only replica addition
pkg/roachpb/metadata_replicas.go, line 345 at r8 (raw file):
votes := make(map[uint64]bool, len(d.wrapped)) for _, rDesc := range d.wrapped { votes[uint64(rDesc.ReplicaID)] = true
Should this be votes[uint64(rDesc.ReplicaID)] = liveFunc(rDesc)
?
pkg/roachpb/metadata_replicas_test.go, line 160 at r8 (raw file):
// the config exactly as described by the descriptor so we don't try. { []ReplicaDescriptor{rd(l, 1), rd(v0, 2), rd(vi, 3)},
Given the description, should this be rd(v, 2)?
pkg/roachpb/metadata_replicas_test.go, line 166 at r8 (raw file):
// set of voters n2, n4 (plus learner n1 before and after). { []ReplicaDescriptor{rd(l, 1), rd(v0, 2), rd(vi, 3), rd(vo, 4)},
Given the description, should this also be rd(v, 2)
? I must be missing something.
pkg/settings/cluster/cockroach_versions.go, line 525 at r10 (raw file):
}, { // VersionAtomicChangeReplicas is TODO(tbg)
Reminder to address this TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commits that LGTM:
- efda358:
storage: add learners one by one
- c9c39d3:
storage: un-embed decodedConfChange
- 1fbdcc1:
storage: skip ApplyConfChange on rejected entry
- 5ebb3e1:
storage: push replication change unrolling into ChangeReplicas
(mod note about unused args) - 27bd59d:
storage: fix replicaGCQueue addition on removal trigger
Thanks for making this digestible.
Also, I think you're missing a change to handle raftpb.EntryConfChangeV2
in pkg/storage/debug_print.go
.
Reviewed 3 of 3 files at r11, 1 of 1 files at r12, 2 of 2 files at r13, 1 of 1 files at r14, 2 of 2 files at r15, 5 of 5 files at r16, 3 of 3 files at r17, 1 of 1 files at r18, 1 of 1 files at r19, 5 of 5 files at r20.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/roachpb/data.go, line 1440 at r16 (raw file):
case ReplicaType_VoterFull, ReplicaType_Learner: // A learner or full voter can't be in the desc after. for _, rd := range replicas {
nit: consider pushing this into a checkNotExists
helper for symmetry.
pkg/roachpb/data.go, line 1450 at r16 (raw file):
sl = append(sl, raftpb.ConfChangeSingle{ NodeID: uint64(rDesc.ReplicaID), Type: raftpb.ConfChangeRemoveNode,
nit: keep these in the same order as above.
pkg/roachpb/data.go, line 1466 at r16 (raw file):
} } wantLeaveJoint := len(added)+len(removed) == 0
Didn't we already define these somewhere (InAtomicReplicationChange
, LeaveJoint
)? Is this what you meant by having trouble avoiding all duplication?
pkg/roachpb/data.go, line 1532 at r16 (raw file):
fmt.Fprintf(&chgS, "<malformed ChangeReplicasTrigger: %s>", err) } else { if cc.AsV2().LeaveJoint() {
Do we need ChangeReplicasTrigger.EnterJoint
and ChangeReplicasTrigger.LeaveJoint
anymore?
pkg/roachpb/data_test.go, line 1730 at r16 (raw file):
}{ // A replica of type VoterOutgoing being added makes no sense. {crt: mk(false, vo1, nil, vo1), err: "can't add replica in state VoterOutgoing"},
nit: these would be I little easier to read if you defined:
v1 := false
v2 := true
Or define a config struct instead of the mk
method. I found myself jumping back and forth when reading these cases. This would be clearer as:
{
v2: false,
add: vo1,
del: nil,
repls: vo1,
err: ...
}
pkg/roachpb/data_test.go, line 1732 at r16 (raw file):
{crt: mk(false, vo1, nil, vo1), err: "can't add replica in state VoterOutgoing"}, // But an incoming one can be added, and the result must be a joint change. {crt: mk(false, vi1, nil, sl(ReplicaType_VoterIncoming, 1)), exp: raftpb.ConfChangeV2{
s/sl(ReplicaType_VoterIncoming, 1)/vi1/
?
And below?
pkg/roachpb/metadata.go, line 145 at r17 (raw file):
) (ReplicaDescriptor, bool) { for i := range r.InternalReplicas { if r.InternalReplicas[i].StoreID == storeID && r.InternalReplicas[i].NodeID == nodeID {
nit:
desc := &r.InternalReplicas[i]
if desc.NodeID == nodeID && desc.StoreID == storeID {
desc.Type = &typ
return *desc, true
}
pkg/storage/allocator.go, line 345 at r11 (raw file):
return AllocatorRemoveLearner, removeLearnerReplicaPriority } if desc.Replicas().InAtomicReplicationChange() {
Does this need to go above the if learners := desc.Replicas().Learners(); len(learners) > 0 {
case, given their relative priorities?
pkg/storage/client_atomic_membership_change_test.go, line 91 at r20 (raw file):
return errors.Errorf("diff(want, have):\n%s", strings.Join(diff, "\n")) } // Check that conf state is up to date. This can fail even though
Strange tab.
pkg/storage/replica.go, line 1332 at r15 (raw file):
case *roachpb.AdminChangeReplicasRequest: chgs := tArgs.Changes() desc, err := r.ChangeReplicas(ctx, &tArgs.ExpDesc, SnapshotRequest_REBALANCE, storagepb.ReasonAdminRequest, "", chgs)
Should we return AdminChangeReplicasResponse, roachpb.Error
from this function now, like we do with all of the other admin request handlers?
pkg/storage/replica_command.go, line 924 at r15 (raw file):
} // Atomic replication change. return r.changeReplicasImpl(ctx, desc, SnapshotRequest_REBALANCE, storagepb.ReasonAdminRequest, "", chgs)
We're just ignoring the provided reason, priority, and details?
pkg/storage/replica_command.go, line 954 at r18 (raw file):
return nil, errors.Errorf("need exactly one change, got %+v", chgs) } if chgs[0].ChangeType == roachpb.ADD_REPLICA {
I know this isn't being touched here, but it would be more immediately obvious what was going on if you did:
switch chgs[0].ChangeType {
case roachpb.ADD_REPLICA:
...
case roachpb.REMOVE_REPLICA:
...
default:
...
}
pkg/storage/replica_command.go, line 1252 at r18 (raw file):
const ( internalChangeTypeAddVoterViaPreemptiveSnap = iota internalChangeTypeAddLearner = iota
Remove the repeated iotas
.
pkg/storage/replica_command.go, line 1257 at r18 (raw file):
) type internalReplicationChange struct {
Leave a small comment about why this is different than all the other enums we have floating around.
pkg/storage/replica_command.go, line 1268 at r18 (raw file):
reason storagepb.RangeLogEventReason, details string, chgs []internalReplicationChange,
This is a nice change.
pkg/storage/replica_command.go, line 582 at r20 (raw file):
return errors.Errorf("ranges are not adjacent; %s != %s", origLeftDesc.EndKey, rightDesc.StartKey) } // For simplicity, don't handle learner replicas or joing states, expect
joint
pkg/storage/replica_command.go, line 987 at r20 (raw file):
} func (r *Replica) maybeLeaveAtomicChangeReplicas(
Give this a comment.
pkg/storage/replica_command.go, line 1004 at r20 (raw file):
} return execChangeReplicasTxn( ctx, r.store, desc, storagepb.ReasonUnknown /* unused */, "", nil, /* iChgs */
No reason, or detail?
pkg/storage/replica_command.go, line 1309 at r20 (raw file):
for _, rDesc := range desc.Replicas().All() { switch rDesc.GetType() { case roachpb.ReplicaType_VoterIncoming, roachpb.ReplicaType_VoterOutgoing:
Don't we have a method for this?
pkg/storage/replica_command.go, line 1318 at r20 (raw file):
var added, removed []roachpb.ReplicaDescriptor useJoint := len(chgs) > 1
nit: pull this entire loop into the if len(chgs) > 0 {
condition and then make the if len(chgs) == 0 {
and else
case so that the structure of this function is more clear (i.e. it either does it change if any are provided or it transitions out of the joint state if no changes are provided).
pkg/storage/replica_proposal_buf.go, line 447 at r16 (raw file):
} cc, err := crt.ConfChange(encodedCtx)
Nice!
pkg/storage/replica_proposal_buf.go, line 453 at r16 (raw file):
} if err := raftGroup.ProposeConfChange(
Mind simplifying this to:
if err := raftGroup.ProposeConfChange(cc); err == raft.ErrProposalDropped {
// Silently ignore dropped proposals (they were always silently
// ignored prior to the introduction of ErrProposalDropped).
// TODO(bdarnell): Handle ErrProposalDropped better.
// https://github.com/cockroachdb/cockroach/issues/21849
} else if err != nil {
firstErr = err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think you're missing a change to handle raftpb.EntryConfChangeV2 in pkg/storage/debug_print.go.
Done.
Thanks for the quick review. I addressed all of the "small" comments. It's not worth coming back to this review yet (if you had gotten all the way through), I'll ping when it's ready.
I'll avoid rebasing this or squashing anything to make this less messy. Catching up should be, hopefully, a breeze.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
Gopkg.toml, line 42 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is the need for this fixed by #39796? Are they related?
No, they're independent. I need to open an issue about this one. Basically the upstream commit that lets the leader reach out to newly added followers right away instead of waiting a heartbeat makes a race between raft snapshots and learner snapshots much more likely, and tests get flaky as a result. Unfortunately we don't have a good minimal fix in our codebase yet, so I'm just going to avoid pulling in the upstream commit for now (will bump etcd separately before this PR merges)
edit: we settled (out of band) on basically the above, but making sure that we only use "atomic replication change" and "joint configuration" (i.e. no "joint replication change")
Haven't made edits yet.
pkg/roachpb/data.go, line 1347 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it worth making this implicit instead of adding a new field to
ChangeReplicasTrigger
?
Yeah, this got lost in the churn. They're unused, removed.
pkg/roachpb/data.go, line 1466 at r16 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Didn't we already define these somewhere (
InAtomicReplicationChange
,LeaveJoint
)? Is this what you meant by having trouble avoiding all duplication?
There's a little bit of duplication because the code that constructs the descriptor also needs to know whether it wants to go through a joint state. The code here needs to reverse engineer a bit of that, but it's also trying to make sure the caller didn't mess anything up. Note that the trigger is essentially validated before we put it into raft by having it construct the ConfState once.
pkg/roachpb/data.proto, line 163 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Same question about whether it's worth introducing the term "atomic replication change" in code when "joint replication change" is already all over.
Maybe you're saying that they aren't synonymous and that "atomic replication change" is a collection of individual replication changes that are made atomic through joint consensus?
Yeah, I don't think they're synonymous though I may not have been consistent and should go back and fix things up (will do once we agree)
In my view the "atomic replication change" is what takes you from your "before" descriptor to the desired "final" descriptor. Internally that change is not "atomic"; it takes you via the joint configuration, which I thus have to name, and if I have to name it, it's probably best to stick to the name that it has in Raft (plus the name is pretty good at describing what's going on). LMK what you think.
pkg/roachpb/data_test.go, line 1730 at r16 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: these would be I little easier to read if you defined:
v1 := false v2 := true
Or define a config struct instead of the
mk
method. I found myself jumping back and forth when reading these cases. This would be clearer as:{ v2: false, add: vo1, del: nil, repls: vo1, err: ... }
Used the struct, much better.
pkg/roachpb/metadata.proto, line 39 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you consider doing something with
option (gogoproto.goproto_enum_prefix) = false;
instead? This does seem to be going against the convention in our code.
Oh, I missed that option. Sounds reasonable, I will defer to the end though because that's not going to be a change that will want to freely rebase where it needs to go.
pkg/storage/allocator.go, line 345 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Does this need to go above the
if learners := desc.Replicas().Learners(); len(learners) > 0 {
case, given their relative priorities?
Fixed and reminder to add test added to #12768 (comment)
pkg/storage/replica.go, line 1332 at r15 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we return
AdminChangeReplicasResponse, roachpb.Error
from this function now, like we do with all of the other admin request handlers?
I wouldn't change this just for the sake of consistency, there isn't a single pErr->err conversion in ChangeReplicas and all of the usages except this one are pretty happy that they're getting a vanilla error
. If anything I'd try to go the other way but there are good reasons for the others to return *roachpb.Error
. All in all a yak for another day.
pkg/storage/replica_command.go, line 924 at r15 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We're just ignoring the provided reason, priority, and details?
Oh, not intentional. Fixed.
pkg/storage/replica_command.go, line 1252 at r18 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Remove the repeated
iotas
.
Heh, funny this even worked.
pkg/storage/replica_proposal_buf.go, line 453 at r16 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mind simplifying this to:
if err := raftGroup.ProposeConfChange(cc); err == raft.ErrProposalDropped { // Silently ignore dropped proposals (they were always silently // ignored prior to the introduction of ErrProposalDropped). // TODO(bdarnell): Handle ErrProposalDropped better. // https://github.com/cockroachdb/cockroach/issues/21849 } else if err != nil { firstErr = err }
I wrote this but unclear to me whether this is better since it diverges from the firstErr; continue
pattern above. I think you have a point but I'll leave as is, I bet the linters will complain about the empty branch anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's ready for a look. I think I addressed all of the comments. Still haven't rebased or squashed to keep things simple.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/roachpb/metadata_replicas.go, line 325 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It took me a little to figure out why this fast-path wasn't the same as a check for
!d.InAtomicReplicationChange()
. Mind spelling out the two cases it can be hit in.
- non-joint quorum
- joint quorum with only replica addition
Done.
pkg/roachpb/metadata_replicas.go, line 345 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should this be
votes[uint64(rDesc.ReplicaID)] = liveFunc(rDesc)
?
🤦♂️ this is now caught in the newly added tests.
pkg/roachpb/metadata_replicas_test.go, line 160 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Given the description, should this be rd(v, 2)?
v0 is just nil, which is treated like v. There's a test case just above your comment for that but I'm just going to use v here and also rename v0 to vn, maybe you're mistaking it for vo and I can't blame you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 27 of 27 files at r21.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/debug_print.go, line 168 at r21 (raw file):
return "", err } if ent.Type == raftpb.EntryNormal {
nit: might as well use a switch-case with a default arm that throws an error to catch this kind of stuff in the future.
pkg/storage/replica_command.go, line 1003 at r21 (raw file):
for _, rDesc := range desc.Replicas().All() { switch rDesc.GetType() { case roachpb.VOTER_OUTGOING, roachpb.VOTER_INCOMING:
Shouldn't desc.InAtomicReplicationChange
be able to tell you this?
pkg/storage/replica_proposal_buf.go, line 453 at r16 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I wrote this but unclear to me whether this is better since it diverges from the
firstErr; continue
pattern above. I think you have a point but I'll leave as is, I bet the linters will complain about the empty branch anyway.
We have the same style with the empty branch below. It's also fine to use the firstErr = err; continue
pattern here, it's just not needed.
f59a8e4
to
3bb8d6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
pkg/storage/debug_print.go, line 168 at r21 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: might as well use a switch-case with a default arm that throws an error to catch this kind of stuff in the future.
It does that today-- the cleanup project here is to actually write proper tests for this. It does bother me a bit to walk away from this but I'm still going to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r22.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
b580635
to
db2e192
Compare
bors r=nvanbenschoten |
This picks up improvements related to joint consensus. We're pinning a SHA that is close to master but not quite there to avoid tickling to a bug in CRDB: cockroachdb#40207 Release note: None
The new code will generalize to new replica types. Release note: None
Make sure there isn't more than one replica per store. Release note: None
This now errors out whenever the replica is not a voter, which is more robust as new replica types are introduced (which generally should not automatically become eligible to receive leases). Release note: None
The current naming is idiomatic for proto enums, but atypical for its usage in Go code. There is no `(gogoproto.customname)` that can fix this, and we're about to add more replica types that would require awkward names such as `roachpb.ReplicaType_VOTER_OUTGOING`. Switch to a Go-friendly naming scheme instead. Release note: None
These are required for atomic replication changes to describe joint configurations, i.e. configurations consisting of two sets of replica which both need to reach quorum to make replication decisions. An audit of existing consumers of this enum will follow. Release note: None
Previous commits introduced (yet unused) voter types to encode joint consensus configurations which occur during atomic replication changes. Access to the slice of replicas is unfortunately common, though at least it's compartmentalized via the getters Voters() and Learners(). The main problem solved in this commit is figuring out what should be returned from Voters(): is it all VoterX types, or only voters in one of the two majority configs part of a joint quorum? The useful answer is returning the set of voters corresponding to what the config will be once the joint state is exited; this happens to be what most callers care about. Incoming and full voters are really the same thing in our code; we just need to distinguish them from outgoing voters to correctly maintain the quorum sizes. Of course there are some callers that do care about quorum sizes, and a number of cleanups were made for them. This commit also adds a ReplicaDescriptors.ConfState helper which is then used in all of the places that were previously cobbling together a ConfState manually. Release note: None
We will use a ChangeReplicasTrigger without additions and removals when transitioning out of a joint configuration, so make sure it supports this properly. Release note: None
This defaults to false, and won't have an effect unless the newly introduced cluster version is also active. Release note: None
These will be upgraded with proper handling when atomic replication changes are actually introduced, but for now it's convenient to stub out some code that will need to handle them and to make sure we won't forget to do so later. Release note: None
Doing more than one change at once is going to force us into an atomic replication change. This isn't crazy, but seems unnecessary at this point, so just add the learners one by one. Release note: None
I ate a number of NPEs during development because nullable embedded fields are tricky; they hide the pointer derefs that often need a nil check. We'll embed the fields of decodedConfChange instead which works out better. This commit also adds the unmarshaling code necessary for ConfChangeV2 needed once we issue atomic replication changes. Release note: None
When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None
There are various callers to ChangeReplicas, so it makes more sense to unroll at that level. The code was updated to - in principle - do the right thing when atomic replication changes are requested, except that they are still unimplemented and a fatal error will serve as a reminder of that. Of course nothing issues them yet. Release note: None
This prepares the trigger -> raft translation code to properly handle atomic replication changes. This carries out a lot of validation to give us confidence that any unusual transitions would be caught quickly. This change also establishes more clearly which added and removed replicas are to be passed into the trigger when transitioning into a joint configuration. For example, when adding a voter, one technically replaces a Learner with a VoterIncoming and so the question is which type the replica in the `added` slice should have. Picking the Learner would give the trigger the most power to validate the input, but it's annoying to have divergent descriptors floating around, so by convention we say that it is always the updated version of the descriptor (i.e. for fully removed replicas, just whatever it was before it disappeared). I spent more time on this than I'm willing to admit, in particular looking removing the redundancy here, but it made things more awkward than was worth it. Release note: None
This is a cleanup leading up to a larger refactor of the contract around `execChangeReplicasTxn`. Release note: None
Prior to this commit, the method took both an old and a new desc *plus* slices of added and removed replicas. This had grown organically, wasn't an easily understood interface, led to repetitive and tricky code at the callers, and most importantly isn't adequate any more in a world with atomic replication changes, where execChangeReplicasTxn in constructing the ChangeReplicasTrigger is essentially deciding whether a joint configuration needs to be entered (which in turn determines what the descriptor needs to look like in the first place). To start solving this, let execChangeReplicasTxn create (and on success return) the new descriptor. Callers instead pass in what they want to be done, which is accomplished via an []internalReplicationChange slice. Release note: None
Once we enter joint changes, the replica to be removed will show up in `crt.Removed()` when the joint state is entered, but it only becomes eligible for actual removal when we leave the joint state later. The new code triggers at the right time, namely when the replica is no longer in the descriptor. Release note: None
They default to OFF. This needs a lot more tests which will be added separately in the course of switching the default to ON and will focus on the interactions of joint states with everything else in the system. We'll also need another audit of consumers of the replica descriptors to make sure nothing was missed in the first pass. Release note: None
Release note: None
Release note: None
Release note: None
db2e192
to
371e289
Compare
Canceled |
bors r=nvanbenschoten Had messed up Gopkg.lock |
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg This PR contains a series of commits that first pave for the way and ultimately allow carrying out atomic replication changes via Raft joint consensus. Atomic replication changes are required to avoid entering unsafe configurations during lateral data movement. See #12768 for details; this is a problem we want to address in 19.2. Before merging this we'll need to sort out an upstream change in Raft which has made a bug in our code related to learner snapshots much more likely; the offending upstream commit is patched out of the vendored etcd bump in this PR at the time of writing. An antichronological listing of the individual commits follows. They should be reviewed individually, though it may be helpful to look at the overall diff for overall context. A modest amount of churn may exist between the commits, though a good deal of effort went into avoiding this. storage: allow atomic replication changes in ChangeReplicas They default to OFF. This needs a lot more tests which will be added separately in the course of switching the default to ON and will focus on the interactions of joint states with everything else in the system. We'll also need another audit of consumers of the replica descriptors to make sure nothing was missed in the first pass. Release note: None storage: fix replicaGCQueue addition on removal trigger Once we enter joint changes, the replica to be removed will show up in `crt.Removed()` when the joint state is entered, but it only becomes eligible for actual removal when we leave the joint state later. The new code triggers at the right time, namely when the replica is no longer in the descriptor. Release note: None storage: let execChangeReplicasTxn construct the descriptor Prior to this commit, the method took both an old and a new desc *plus* slices of added and removed replicas. This had grown organically, wasn't an easily understood interface, led to repetitive and tricky code at the callers, and most importantly isn't adequate any more in a world with atomic replication changes, where execChangeReplicasTxn in constructing the ChangeReplicasTrigger is essentially deciding whether a joint configuration needs to be entered (which in turn determines what the descriptor needs to look like in the first place). To start solving this, let execChangeReplicasTxn create (and on success return) the new descriptor. Callers instead pass in what they want to be done, which is accomplished via an []internalReplicationChange slice. Release note: None roachpb: auto-assign ReplicaID during AddReplica This is a cleanup leading up to a larger refactor of the contract around `execChangeReplicasTxn`. Release note: None storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate This prepares the trigger -> raft translation code to properly handle atomic replication changes. This carries out a lot of validation to give us confidence that any unusual transitions would be caught quickly. This change also establishes more clearly which added and removed replicas are to be passed into the trigger when transitioning into a joint configuration. For example, when adding a voter, one technically replaces a Learner with a VoterIncoming and so the question is which type the replica in the `added` slice should have. Picking the Learner would give the trigger the most power to validate the input, but it's annoying to have divergent descriptors floating around, so by convention we say that it is always the updated version of the descriptor (i.e. for fully removed replicas, just whatever it was before it disappeared). I spent more time on this than I'm willing to admit, in particular looking removing the redundancy here, but it made things more awkward than was worth it. Release note: None storage: push replication change unrolling into ChangeReplicas There are various callers to ChangeReplicas, so it makes more sense to unroll at that level. The code was updated to - in principle - do the right thing when atomic replication changes are requested, except that they are still unimplemented and a fatal error will serve as a reminder of that. Of course nothing issues them yet. Release note: None storage: skip ApplyConfChange on rejected entry When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None storage: un-embed decodedConfChange I ate a number of NPEs during development because nullable embedded fields are tricky; they hide the pointer derefs that often need a nil check. We'll embed the fields of decodedConfChange instead which works out better. This commit also adds the unmarshaling code necessary for ConfChangeV2 needed once we issue atomic replication changes. Release note: None storage: add learners one by one Doing more than one change at once is going to force us into an atomic replication change. This isn't crazy, but seems unnecessary at this point, so just add the learners one by one. Release note: None storage: add fatals where atomic conf changes are unsupported These will be upgraded with proper handling when atomic replication changes are actually introduced, but for now it's convenient to stub out some code that will need to handle them and to make sure we won't forget to do so later. Release note: None storage: add atomic replication changes cluster setting This defaults to false, and won't have an effect unless the newly introduced cluster version is also active. Release note: None roachpb: support zero-change ChangeReplicasTrigger We will use a ChangeReplicasTrigger without additions and removals when transitioning out of a joint configuration, so make sure it supports this properly. Release note: None roachpb: return "desired" voters from ReplicaDescriptors.Voters Previous commits introduced (yet unused) voter types to encode joint consensus configurations which occur during atomic replication changes. Access to the slice of replicas is unfortunately common, though at least it's compartmentalized via the getters Voters() and Learners(). The main problem solved in this commit is figuring out what should be returned from Voters(): is it all VoterX types, or only voters in one of the two majority configs part of a joint quorum? The useful answer is returning the set of voters corresponding to what the config will be once the joint state is exited; this happens to be what most callers care about. Incoming and full voters are really the same thing in our code; we just need to distinguish them from outgoing voters to correctly maintain the quorum sizes. Of course there are some callers that do care about quorum sizes, and a number of cleanups were made for them. This commit also adds a ReplicaDescriptors.ConfState helper which is then used in all of the places that were previously cobbling together a ConfState manually. Release note: None roachpb: add ReplicaType_Voter{Incoming,Outgoing} These are required for atomic replication changes to describe joint configurations, i.e. configurations consisting of two sets of replica which both need to reach quorum to make replication decisions. An audit of existing consumers of this enum will follow. Release note: None roachpb: rename ReplicaType variants The current naming is idiomatic for proto enums, but atypical for its usage in Go code. There is no `(gogoproto.customname)` that can fix this, and we're about to add more replica types that would require awkward names such as `roachpb.ReplicaType_VOTER_OUTGOING`. Switch to a Go-friendly naming scheme instead. Release note: None batcheval: generalize checkNotLearnerReplica This now errors out whenever the replica is not a voter, which is more robust as new replica types are introduced (which generally should not automatically become eligible to receive leases). Release note: None roachpb: improve RangeDescriptor.Validate Make sure there isn't more than one replica per store. Release note: None roachpb: generalize ReplicaDescriptor.String() The new code will generalize to new replica types. Release note: None [dnm] vendor: bump raft This picks up upstream fixes related to atomic membership changes. I had to smuggle in a small hack because we're picking up etcd-io/etcd#11037 which makes a race between the snapshot queue and the proactive learner snapshot much more likely, and this in turn makes tests quite flaky because it turns out that if the learner snap loses, it can actually error out. Release note: None storage: avoid fatal error from splitPostApply This is the next band-aid on top of #39658 and #39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. Release note: None 40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany Addresses part of #39938. Release note (cli change): Default cluster locality topologies for multi-node cockroach demo clusters. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
pkg/storage/replica_metrics.go, line 172 at r8 (raw file):
if rangeCounter { unavailable = !desc.Replicas().CanMakeProgress(func(rDesc roachpb.ReplicaDescriptor) bool { _, live := livenessMap[rDesc.NodeID]
Is this right here? The maps has nodes in it that are not live.
… On Fri, Feb 14, 2020 at 5:22 PM Andrei Matei ***@***.***> wrote:
***@***.**** commented on this pull request.
*Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/39936>*
status: [image: ] complete! 0 of 0 LGTMs obtained (and 1 stale)
(waiting on @nvanbenschoten <https://github.com/nvanbenschoten> and @tbg
<https://github.com/tbg>)
------------------------------
*pkg/storage/replica_metrics.go, line 172 at r8
<https://reviewable.io/reviews/cockroachdb/cockroach/39936#-M03VNWm1oh1clsHRjgD:-M03VNWm1oh1clsHRjgE:bym03bn>
(raw file
<https://github.com/cockroachdb/cockroach/blob/0314069b4345b50b13611906948ce8359d97b1bd/pkg/storage/replica_metrics.go#L172>):*
if rangeCounter {
unavailable = !desc.Replicas().CanMakeProgress(func(rDesc roachpb.ReplicaDescriptor) bool {
_, live := livenessMap[rDesc.NodeID]
Is this right here? The maps has nodes in it that are not live.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39936?email_source=notifications&email_token=ABGXPZB2MWIDRVBF2LYFXPTRC3AMDA5CNFSM4IO7PMGKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCVTNBLI#pullrequestreview-359059629>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZF3CFKAG6AXAULZHKTRC3AMDANCNFSM4IO7PMGA>
.
|
Found by @andreimatei in cockroachdb#39936 (review) Release note (bug fix): a bug in the range metrics collection could fail to correctly identify a range that had lost quorum, causing it to not be reported via the "unavailable ranges" metric. This is now fixed.
Found by @andreimatei in cockroachdb#39936 (review) Release note (bug fix): a bug in the range metrics collection could fail to correctly identify a range that had lost quorum, causing it to not be reported via the "unavailable ranges" metric. This is now fixed.
45253: storage: fix bug in calcRangeCounter r=andreimatei a=tbg Found by @andreimatei in #39936 (review) Release note (bug fix): a bug in the range metrics collection could fail to correctly identify a range that had lost quorum, causing it to not be reported via the "unavailable ranges" metric. This is now fixed. 45534: coldata: change Batch interface to operate with int length r=yuzefovich a=yuzefovich This commit changes Batch interface to operate with `int` length because it unifies the way we store batches - now we will have a single implementation `MemBatch` (with an exception for `coldata.ZeroBatch`). This allows us to remove `bufferedBatch`. Release note: None 45554: kv: add a comment r=andreimatei a=andreimatei Explain the surprising fact that we're telling then inflight-writes btree to not reuse its nodes. Release note: None 45580: storage: improve "range unavailable" log message r=nvanbenschoten a=tbg I've had to look at this message on users clusters recently and I found that it could be more actionable: - adjusted the wording to steer enterprise users towards enterprise support - print the range descriptor - print which replicas we think are live - log this message as an error instead of a warning (this message is always a serious problem that demands immediate attention). Release note (general change): Improved a debug message that is printed when a range is unavailable (i.e. unable to accept writes). 45585: roachtest: fix copy/bank roachtest to deal with smaller ranges r=ajwerner a=ajwerner In #45451 we made the expected number of ranges reflect the current zone config. The problem with this change is that it used a factor of 2x the minimum range size to estimate the upper bound on number of ranges. There was no principled reason to make this assumption. We should instead just assert that the ranges are larger than the minimum range size on average. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This PR contains a series of commits that first pave for the way and ultimately
allow carrying out atomic replication changes via Raft joint consensus.
Atomic replication changes are required to avoid entering unsafe configurations
during lateral data movement. See #12768 for details; this is a problem we want
to address in 19.2.
Before merging this we'll need to sort out an upstream change in Raft which
has made a bug in our code related to learner snapshots much more likely; the
offending upstream commit is patched out of the vendored etcd bump in this PR
at the time of writing.
An antichronological listing of the individual commits follows. They should be
reviewed individually, though it may be helpful to look at the overall diff for
overall context. A modest amount of churn may exist between the commits, though
a good deal of effort went into avoiding this.