-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PlannedReparentShard: Fix more known-recoverable problems. #5376
PlannedReparentShard: Fix more known-recoverable problems. #5376
Conversation
go/vt/wrangler/reparent.go
Outdated
// catch up before we time out. This assumes a replica can work through | ||
// its backlog at approximately the same rate that the transactions | ||
// happened on the master. | ||
if float64(status.SecondsBehindMaster) >= waitReplicasTimeout.Seconds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern with piggy-backing the replication lag timeout on waitReplicasTimeout is that it doesn't allow people to opt out of this check. We are arguing that this is OK because a replica with lag > waitReplicasTimeout is unlikely to catch up during that time.
SecondsBehindMaster is in fact not reliable and there are cases when it doesn't get updated often enough by mysql and is actually reported as much higher than the real lag. This seems to happen in precisely the situation we are creating here - where there are no more writes to master.
See last comment in #5000 (comment)
We should consider adding another flag (-skip_replication_lag_check
) to the PRS command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. What about making a flag to customize the SecondsBehindMaster threshold, instead of an on/off thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I had originally planned to implement for #4700, so I'll vote yes on that.
PlannedReparentShard should be able to fix replication as long as all tablets are reachable and all replication positions are in a mutually-consistent state. PRS also no longer trusts that the shard record contains up-to-date information on the master, because we update that record asynchronously now. Instead, it looks at MasterTermStartTime values stored in each master tablet's record, so it makes the same choice of master as vtgates. Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
39fcd75
to
5e39eef
Compare
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I just have a nit in one of the error messages.
go/vt/wrangler/reparent.go
Outdated
} | ||
// Check if it's behind by a small enough amount. | ||
if float64(status.SecondsBehindMaster) > masterElectLagThreshold.Seconds() { | ||
return vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "replication lag on master-elect %v (%v seconds) is greater than the specified lag threshold (%v); let replication catch up first or try again with a higher threshold", masterElectTabletAliasStr, status.SecondsBehindMaster, masterElectLagThreshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lag threshold -> lag_threshold
to be consistent with user-visible flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this flag entirely since we don't check lag anymore, as discussed offline.
if topoproto.TabletAliasEqual(shardInfo.MasterAlias, masterElectTabletAlias) { | ||
// If the master is already the one we want, we just try to fix replicas (below). | ||
rp, err := wr.tmc.MasterPosition(remoteCtx, masterElectTabletInfo.Tablet) | ||
if currentMaster == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since PRS is now handling the cases of no master / multi-master, that means situations where ERS is required should become rare to nonexistent. Let us make sure to document that in the eventual PR for merging upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. To summarize, my goal for PRS is that eventually it should be able to fix almost any problem as long as:
- All tablets are reachable, so we can check global state.
AND - The global replication state (relative positions) is consistent and compatible with making the chosen tablet the master.
You should then only need ERS in the following cases:
- The current master is unreachable.
OR - The relative replication positions have become inconsistent (e.g. alternative futures).
OR - It's unclear who the current master is, and some tablets are unreachable, which means we can't be sure if the global state is consistent.
e9195e1
to
65c43c3
Compare
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
65c43c3
to
df16897
Compare
@@ -184,7 +184,7 @@ type TabletManagerClient interface { | |||
// SetMaster tells a tablet to make itself a slave to the | |||
// passed in master tablet alias, and wait for the row in the | |||
// reparent_journal table (if timeCreatedNS is non-zero). | |||
SetMaster(ctx context.Context, tablet *topodatapb.Tablet, parent *topodatapb.TabletAlias, timeCreatedNS int64, forceStartSlave bool) error | |||
SetMaster(ctx context.Context, tablet *topodatapb.Tablet, parent *topodatapb.TabletAlias, timeCreatedNS int64, waitPosition string, forceStartSlave bool) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't changing the interface here cause problems during upgrade?
Old vtctld's wrangler will call the old version of SetMaster, which won't work on an already upgraded vttablet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the call crosses process boundaries, it gets encoded as protobuf on the wire. The protobuf level is thus where we need to ensure compatibility when changing existing RPCs.
Adding a new, optional field in the Request struct like this should be safe. The old vtctld will not try to use the new field because it doesn't know about it. The new vttablet will simply receive a Request protobuf with the new field unset, so it will be left on the zero value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PlannedReparentShard should be able to fix replication as long as all
tablets are reachable and all replication positions are in a
mutually-consistent state.
PRS also no longer trusts that the shard record contains up-to-date
information on the master, because we update that record asynchronously
now. Instead, it looks at MasterTermStartTime values stored in each
master tablet's record, so it makes the same choice of master as
vtgates.
Signed-off-by: Anthony Yeh enisoc@planetscale.com