PlannedReparentShard: Fix more known-recoverable problems. #5376

enisoc · 2019-10-29T23:39:37Z

PlannedReparentShard should be able to fix replication as long as all
tablets are reachable and all replication positions are in a
mutually-consistent state.

PRS also no longer trusts that the shard record contains up-to-date
information on the master, because we update that record asynchronously
now. Instead, it looks at MasterTermStartTime values stored in each
master tablet's record, so it makes the same choice of master as
vtgates.

Signed-off-by: Anthony Yeh enisoc@planetscale.com

deepthi · 2019-10-30T17:49:47Z

go/vt/wrangler/reparent.go

+		// catch up before we time out. This assumes a replica can work through
+		// its backlog at approximately the same rate that the transactions
+		// happened on the master.
+		if float64(status.SecondsBehindMaster) >= waitReplicasTimeout.Seconds() {


One concern with piggy-backing the replication lag timeout on waitReplicasTimeout is that it doesn't allow people to opt out of this check. We are arguing that this is OK because a replica with lag > waitReplicasTimeout is unlikely to catch up during that time.

SecondsBehindMaster is in fact not reliable and there are cases when it doesn't get updated often enough by mysql and is actually reported as much higher than the real lag. This seems to happen in precisely the situation we are creating here - where there are no more writes to master.
See last comment in #5000 (comment)

We should consider adding another flag (-skip_replication_lag_check) to the PRS command.

Good catch. What about making a flag to customize the SecondsBehindMaster threshold, instead of an on/off thing?

That is what I had originally planned to implement for #4700, so I'll vote yes on that.

PlannedReparentShard should be able to fix replication as long as all tablets are reachable and all replication positions are in a mutually-consistent state. PRS also no longer trusts that the shard record contains up-to-date information on the master, because we update that record asynchronously now. Instead, it looks at MasterTermStartTime values stored in each master tablet's record, so it makes the same choice of master as vtgates. Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

deepthi

Nice! I just have a nit in one of the error messages.

deepthi · 2019-10-31T01:02:10Z

go/vt/wrangler/reparent.go

+		}
+		// Check if it's behind by a small enough amount.
+		if float64(status.SecondsBehindMaster) > masterElectLagThreshold.Seconds() {
+			return vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "replication lag on master-elect %v (%v seconds) is greater than the specified lag threshold (%v); let replication catch up first or try again with a higher threshold", masterElectTabletAliasStr, status.SecondsBehindMaster, masterElectLagThreshold)


lag threshold -> lag_threshold
to be consistent with user-visible flag.

I removed this flag entirely since we don't check lag anymore, as discussed offline.

deepthi · 2019-10-31T01:07:20Z

go/vt/wrangler/reparent.go

-	if topoproto.TabletAliasEqual(shardInfo.MasterAlias, masterElectTabletAlias) {
-		// If the master is already the one we want, we just try to fix replicas (below).
-		rp, err := wr.tmc.MasterPosition(remoteCtx, masterElectTabletInfo.Tablet)
+	if currentMaster == nil {


Since PRS is now handling the cases of no master / multi-master, that means situations where ERS is required should become rare to nonexistent. Let us make sure to document that in the eventual PR for merging upstream.

Good point. To summarize, my goal for PRS is that eventually it should be able to fix almost any problem as long as:

All tablets are reachable, so we can check global state.
AND

The global replication state (relative positions) is consistent and compatible with making the chosen tablet the master.

You should then only need ERS in the following cases:

The current master is unreachable.
OR

The relative replication positions have become inconsistent (e.g. alternative futures).
OR

It's unclear who the current master is, and some tablets are unreachable, which means we can't be sure if the global state is consistent.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

deepthi · 2019-10-31T15:24:42Z

go/vt/vttablet/tmclient/rpc_client_api.go

@@ -184,7 +184,7 @@ type TabletManagerClient interface {
 	// SetMaster tells a tablet to make itself a slave to the
 	// passed in master tablet alias, and wait for the row in the
 	// reparent_journal table (if timeCreatedNS is non-zero).
-	SetMaster(ctx context.Context, tablet *topodatapb.Tablet, parent *topodatapb.TabletAlias, timeCreatedNS int64, forceStartSlave bool) error
+	SetMaster(ctx context.Context, tablet *topodatapb.Tablet, parent *topodatapb.TabletAlias, timeCreatedNS int64, waitPosition string, forceStartSlave bool) error


Doesn't changing the interface here cause problems during upgrade?
Old vtctld's wrangler will call the old version of SetMaster, which won't work on an already upgraded vttablet.

When the call crosses process boundaries, it gets encoded as protobuf on the wire. The protobuf level is thus where we need to ensure compatibility when changing existing RPCs.

Adding a new, optional field in the Request struct like this should be safe. The old vtctld will not try to use the new field because it doesn't know about it. The new vttablet will simply receive a Request protobuf with the new field unset, so it will be left on the zero value.

deepthi

LGTM

enisoc requested a review from deepthi October 29, 2019 23:39

enisoc requested a review from sougou as a code owner October 29, 2019 23:39

deepthi reviewed Oct 30, 2019

View reviewed changes

enisoc added 3 commits October 30, 2019 12:45

PlannedReparentShard: Add -lag_threshold flag.

ef8b9f7

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

Fix expected error in reparent test.

5e39eef

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

enisoc force-pushed the prs-tablet-timestamp branch from 39fcd75 to 5e39eef Compare October 30, 2019 20:00

PRS: Add test case for graceful recovery.

4384ecd

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

deepthi approved these changes Oct 31, 2019

View reviewed changes

enisoc force-pushed the prs-tablet-timestamp branch from e9195e1 to 65c43c3 Compare October 31, 2019 06:38

PRS: Measure replication progress instead of lag.

df16897

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

enisoc force-pushed the prs-tablet-timestamp branch from 65c43c3 to df16897 Compare October 31, 2019 08:37

deepthi reviewed Oct 31, 2019

View reviewed changes

deepthi approved these changes Oct 31, 2019

View reviewed changes

enisoc merged commit 9a09f82 into vitessio:reparent-refactor Oct 31, 2019

enisoc deleted the prs-tablet-timestamp branch October 31, 2019 19:39

spark4 mentioned this pull request Nov 12, 2019

Serry deploy tinyspeck/vitess#140

Closed

spark4 mentioned this pull request Nov 22, 2019

Slack sync upstream 2019 11 09.r0 tinyspeck/vitess#142

Merged

This was referenced Aug 17, 2020

STORAGE-4262: Automatically Recover From Failed PRS Operation etsy/vitess#5

Closed

STORAGE-4262: Automatically Recover From Failed PRS Operation etsy/vitess#6

Closed

systay pushed a commit that referenced this pull request Jul 22, 2024

cherry pick of 16114 (#5376)

b9e4c1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PlannedReparentShard: Fix more known-recoverable problems. #5376

PlannedReparentShard: Fix more known-recoverable problems. #5376

enisoc commented Oct 29, 2019

deepthi Oct 30, 2019 •

edited

Loading

enisoc Oct 30, 2019

deepthi Oct 30, 2019

deepthi left a comment

deepthi Oct 31, 2019

enisoc Oct 31, 2019

deepthi Oct 31, 2019

enisoc Oct 31, 2019

deepthi Oct 31, 2019

enisoc Oct 31, 2019

deepthi left a comment

PlannedReparentShard: Fix more known-recoverable problems. #5376

PlannedReparentShard: Fix more known-recoverable problems. #5376

Conversation

enisoc commented Oct 29, 2019

deepthi Oct 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

deepthi Oct 30, 2019 •

edited

Loading