-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make TabletExternallyReparented more robust #5151
Conversation
}() | ||
} | ||
|
||
// update the shard record first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why update the shard record first? Traditionally, updating the shard record means the reparent is done. That's how we ensure that subsequent TER calls will fix the tablet records if this attempt fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at the latest and let me know if you still think this is a problem
go/vt/wrangler/reparent.go
Outdated
@@ -498,22 +498,21 @@ func (wr *Wrangler) plannedReparentShardLocked(ctx context.Context, ev *events.R | |||
wgSlaves.Wait() | |||
return fmt.Errorf("failed to PopulateReparentJournal on master: %v", masterErr) | |||
} | |||
// Wait for the slaves to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this change breaks one of the integ tests:
https://github.com/vitessio/vitess/blob/master/test/reparent.py#L572
We end up with this error:
TopologyServer has inconsistent state for shard master
because we have updated the masterElect to be master, but not updated the shard record.
hence I have reverted it.
cc: @enisoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that test case raises a good point. We can't require slaves to complete in order to consider the reparent finished, because we need to be able to make progress even if some slaves are down.
// timestamp to the current time. | ||
agent.setExternallyReparentedTime(startTime) | ||
|
||
if topoproto.TabletAliasEqual(si.MasterAlias, tablet.Alias) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This short circuit was important because YouTube's equivalent of Orchestrator would call TabletExternallyReparented every 10 seconds on the master, even if nothing has changed. We don't do that yet in the Orchestrator integration, but it was always intended and the integration is incomplete without that constantly repeating signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!! Why did it do that? Was the intent to have it basically be a self healing loop?
// This is where updateState will block for gracePeriod, while it gives | ||
// vtgate a chance to stop sending replica queries. | ||
agent.updateState(ctx, newTablet, "fastTabletExternallyReparented") | ||
// always run finalize even if the master tablet has the correct state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to be a significant departure from the original design of TER - in particular the split into a fast stage with the short-circuit based on the shard record and a slow "finalize" stage that only runs when something has actually changed.
By recommending TER as a fix for cases when PlannedReparentShard fails, we're also mixing code paths that were never intended to be mixed. TER was supposed to only be used when Vitess-native reparents are disabled.
I'm starting to think what we really need is to make PlannedReparentShard itself more robust. If PRS fails, the solution should be to run PRS again pointing at the same master. It's been a longstanding problem with Vitess reparents that there are no commands to recover from partial operations. I don't think co-opting TER for a purpose outside its intended scope is the right answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good summary of the original design?
https://vitess.io/docs/user-guides/reparenting/#external-reparenting
Code in master as of right now doesn't actually perform part2 of step 5 (change tablet type to spare).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a longstanding problem with Vitess reparents that there are no commands to recover from partial operations.
This is a bit out of scope for this PR, but 👍 I agree completely. Partial Vitess reparents have been tricky for us to manually recover from in the past, and it'd be great to make it safe to retry without having to manually inspect the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TER was supposed to only be used when Vitess-native reparents are disabled.
This is really interesting history that I didn't know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part2 of step 5 (changing the tablet type of the unresponsive old master to spare
) seems like it would actually solve a lot of problems we've seen when a master is unavailable for a bit, a TER happens, and then the master comes back online. I would prefer the spare
(or other non-serving tablet type) to setting it to replica
.
go/vt/wrangler/reparent.go
Outdated
@@ -498,22 +498,21 @@ func (wr *Wrangler) plannedReparentShardLocked(ctx context.Context, ev *events.R | |||
wgSlaves.Wait() | |||
return fmt.Errorf("failed to PopulateReparentJournal on master: %v", masterErr) | |||
} | |||
// Wait for the slaves to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that test case raises a good point. We can't require slaves to complete in order to consider the reparent finished, because we need to be able to make progress even if some slaves are down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think we need to step back and redefine our goals. We shouldn't try to make TER better suited to fix when PRS experiences partial success. TER was never meant to be used on the same cluster as PRS at all.
Instead, our end goal should be that the fix for any partial failure of PRS is simply to run PRS again with the same arguments. We need to make PRS smart enough to skip things that are already consistent with the desired end state and fix up anything that isn't consistent.
I agree in principle with One thing missing in Finally, a sanity check to make sure replication is not too far behind. If so, don't even start the reparent. |
What is the strategy if TER fails (supposing you never ran a PRS)? |
If we make PRS mean "fix up anything that's inconsistent with the idea of X being the master", then I would expect the "rollback" process to be "run PRS pointing at the old master". Or do you feel it's important for usability or automation that the command doesn't require you to specify the old master? In that case, I still would prefer something like
In the world for which TER was designed (when all Vitess-native reparents are disabled), running TER again should always be the answer. It may not perfectly handle all possible cases, but that's the intended strategy. |
It comes down to whether we have enough metadata to revert the last attempted reparent. We should support I'm actually planning to use the new |
At a high level, I'd love more context about how this PR fits in Vitess's design intentions around master failure. There's a lot of interesting historical nuggets of information around TER and PRS in this PR, as is, and I'm curious if @enisoc or @sougou have any more knowledge to share. My understanding today about Vitess's design principles about master failure (which could be wrong), was that the intention is that on hard crashes of master tablets, that the masters should be deprovisioned and never restarted, as it's unsafe to have them serve again as masters. I'm curious if this was actually an explicit design principle of Vitess, or if this was just a side effect due to running on Borg. With how we run Vitess at Slack, it is definitely unsafe for a master to come back again, even as a replica, after a hard crash. We run with In this PR, it seems like if a master hard crashes and restarts, it'll be made a replica again and available for later promotion to master, which would cause errant transactions and replication breakage. For us, we'd prefer to have any lingering tablets who think they're still the master be marked as |
@zmagg this is very good feedback. It prompted me to go look at the history of TER changing old master to spare vs replica. I found the commit but not an explanation 😞 It seems like there was a decision to reduce/remove usage of spare, but I think we'll need to have @enisoc or @sougou provide context. |
This PR itself doesn't change anything about how we handle master failures. It just fixes some problems we found where certain failures led to states that were hard to recover from. For clarity, we've now split this into three objectives, which will become separate PRs.
But the questions below are good ones, and answers are below:
This was an explicit design principle while at YouTube. However, people have started using mounted drives, especially while running on Kubernetes. In such cases, they rely on the durability guarantees provided by the mounted drives and prefer to recover instead of using a restore. This is because kubernetes local storage is too ephemeral, and they want better assurances that all their data will not be lost if there's a catastrophic failure.
Right. But this failure mode is easy to remedy because the old master will not be able to replicate from the new master because of the errant GTIDs. If this happens, we just have to restart the vttablet in "restore mode" (empty datadir).
This would be true if |
I'm very excited about PRS improvements and eliminating TER from our management. We end up with imposter double masters fairly often. |
tablet := agent.Tablet() | ||
tabletMap, err := agent.TopoServer.GetTabletMapForShard(ctx, tablet.Keyspace, tablet.Shard) | ||
// make the channel buffer big enough that it doesn't block senders | ||
tablets := make(chan topodatapb.Tablet, len(tabletMap)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this tabletsToRefresh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're going to ignore errors from GetTabletMapForShard
, len(tabletMap)
could be 0
which would cause a deadlock below. Might as well add 1 to the buffer size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we might not technically need a channel at all. Since we bail out if any errors occur, we should get the same result if we append tablets to a slice before launching each per-tablet goroutine (before we know whether the tablet record update will succeed, but after we decided to try to update it), which removes the need for synchronization. However, I'm fine keeping this as a channel if you prefer, since the proof that it isn't needed is a bit obscure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was being cautious here, because UpdateTabletFields reads the latest tablet info from topo, and updates and returns it. This ensures that we are using the latest tabletInfo, but maybe it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. I don't think it matters in this case since I believe tmc.RefreshState
only looks at the tablet host/port, and we aren't changing those fields. There's a slight possibility the tablet might update its own host/port in between when we first read it and when we call RefreshState, but that window will always be non-zero and it doesn't get much smaller by taking the result from UpdateTabletFields, since we call those concurrently right after we fetched the latest values from GetTabletMapForShard.
That said, the above is yet another non-obvious nuance in the proof that we don't need synchronization, so it's probably better to just keep the synchronization to be defensive against future code changes.
tablet := agent.Tablet() | ||
// update any other tablets claiming to be MASTER also to REPLICA | ||
for alias, tabletInfo := range tabletMap { | ||
if alias != topoproto.TabletAliasString(agent.TabletAlias) && alias != topoproto.TabletAliasString(oldMasterAlias) && tabletInfo.Tablet.Type == topodatapb.TabletType_MASTER { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can compare directly without converting to string: topoproto.TabletAliasEqual(tabletInfo.Alias, x)
@@ -160,9 +161,14 @@ func (agent *ActionAgent) finalizeTabletExternallyReparented(ctx context.Context | |||
} | |||
}() | |||
|
|||
tablet := agent.Tablet() | |||
tabletMap, err := agent.TopoServer.GetTabletMapForShard(ctx, tablet.Keyspace, tablet.Shard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log this error and add a comment saying we intentionally do not bail out because we still want to process whatever we found if there is a "partial result" error (some cells inaccessible), and even if this completely fails we still want to process the old master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean print to the log or collect it in errs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be printed to log and not collected in errs
. If we collect it in errs
it will cause us to bail out later. However, that will mean TER gets stuck/blocked if any topo cells are unavailable.
This is a common concern throughout Vitess. Any time we try to access topo cross-cell (as we're doing here, since GetTabletMapForShard has to read tablets from all cells where the shard has tablets), we need to be careful to not get blocked in the case when one or more cells is unavailable. Otherwise, every cell becomes a single point of failure for that operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is one of the reasons we call finalize with a relatively short timeout (30s), because any of the various things we are doing in finalize could block if the target component is unavailable.
if !topoproto.TabletAliasIsZero(oldMasterAlias) { | ||
wg.Add(1) | ||
go func() { | ||
log.Infof("finalizeTabletExternallyReparented: updating tablet record for old master: %v", oldMasterAlias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, defer wg.Done()
should remain the first line so it's immediately clear that it always gets called.
if alias != topoproto.TabletAliasString(agent.TabletAlias) && alias != topoproto.TabletAliasString(oldMasterAlias) && tabletInfo.Tablet.Type == topodatapb.TabletType_MASTER { | ||
log.Infof("finalizeTabletExternallyReparented: updating tablet record for another old master: %v", alias) | ||
wg.Add(1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to pass in a copy of any loop variables needed inside the goroutine, as you do below.
@@ -221,22 +250,24 @@ func (agent *ActionAgent) finalizeTabletExternallyReparented(ctx context.Context | |||
errs.RecordError(err) | |||
} | |||
}() | |||
if !topoproto.TabletAliasIsZero(oldMasterAlias) { | |||
wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a new Wait here? These steps were previously done concurrently on purpose, since they aren't interdependent.
…master to REPLICA Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
95150af
to
7bd6618
Compare
@enisoc thanks for the detailed feedback. I have addressed the changes you requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Had one more potential defensive suggestion.
Is there something further we can do to handle this?
It looks ok to me, afaict. I only mentioned that as justification for not bailing out on a GetTabletMapForShard error, which we don't do.
var err error | ||
tab, err := agent.TopoServer.UpdateTabletFields(ctx, alias, | ||
func(tablet *topodatapb.Tablet) error { | ||
tablet.Type = topodatapb.TabletType_REPLICA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is overly paranoid, but perhaps we should add both here and in the original code a check that the tablet type is still MASTER before we force it to REPLICA? There's a small chance we might race with a human trying to do something like force it SPARE, and then we overwrite it to REPLICA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of one minor improvement in the flow: update the MasterAlias as soon as we have successfully updated the old master to be a replica. There's no use remembering the old value after this.
This way, if the next step of chasing down the rest of the tablets fails, then the updated MasterAlias is still usable by others.
I thought one of the goals was that, whenever feasible, rerunning TER should fix up anything that was missed due to partial failure. If we update the shard record but updating the impostor master fails, then we will not retry next time since we assume the shard record being updated means everything is done. |
Right. I was only recommending that the MasterAlias be changed after we've successfully downgraded the old master. But before checking the rest of the tablets ( |
Specifically, the use case I have in mind is the one where we update the old master successfully, but the part that checks if there are other impostors chronically fails. If so, we'll never update the MasterAlias, which will cause all workflows that rely on this info to fail. |
If we fail to read topo to even check for impostors, we just skip the check. It will only block if we actually do find an impostor master. In that case, do you really prefer that we leave what we know is an impostor master behind and mark the shard as done so we never retry fixing the impostor master? I'm not asking because I disagree. I just want to be clear on the trade-off we're making. |
@deepthi Sugu and I talked and worked out his concerns. Here's what we're thinking now:
How does that sound to you? |
Sounds good. I will make the changes. |
…does not complete all steps Signed-off-by: deepthi <deepthi@planetscale.com>
LGTM |
Signed-off-by: deepthi <deepthi@planetscale.com>
…oops into 1 and remove use of channel that is no longer needed Signed-off-by: deepthi <deepthi@planetscale.com>
… use one tmc for all calls to RefreshState Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This is an attempt to fix some problems found by @sougou and @enisoc
TabletExternallyReparented should set the type to REPLICA not only on the current "oldMaster" but also on any lingering old masters from previous unsuccessful reparents.
Changes in this PR (edited from the list written by @enisoc below)
When is TER is marked as finished vs failed?
Signed-off-by: deepthi deepthi@planetscale.com