make TabletExternallyReparented more robust #5151

deepthi · 2019-08-31T01:01:36Z

This is an attempt to fix some problems found by @sougou and @enisoc
TabletExternallyReparented should set the type to REPLICA not only on the current "oldMaster" but also on any lingering old masters from previous unsuccessful reparents.

Changes in this PR (edited from the list written by @enisoc below)

Remove the short circuit on the shard record being up-to-date. This ensures future TER calls will try to do cleanup even if the shard record has already been updated.
Update the shard record as soon as the new and old master tablet records have been updated. This ensures that everyone knows who the new master is even if we fail to update some of the impostors, which in the context of TER is only a best-effort cleanup.
In parallel with updating the shard record, we should immediately start trying to refresh the old master, since ensuring it knows it's not master is the next most important thing.
Only after all that do we start trying to do best-effort cleanup of impostors: listing tablets for the shard, updating tablet records, and refreshing tablets.

When is TER is marked as finished vs failed?

if the global shard record was updated correctly AND the old master and new master tablet records were updated correctly AND any lingering old masters have had their tablet records updated correctly, then TER is marked as finished. If the topo is healthy, there is no reason for any of these to fail.
if any of these fails, then it is marked as failed. re-running TER is safe and might make more progress than a previous attempt.
if the old master(s) have not refreshed their state correctly, we only log a warning (along with the error from the call to RefreshState), we don't mark TER as failed. This is only expected to happen if that particular tablet is unreachable. However, you do need to watch for if and when the old tablets come back online. An automated fix to this situation is planned for a future PR (Issue Handle two-master scenarios #5173)

Signed-off-by: deepthi deepthi@planetscale.com

enisoc · 2019-08-31T03:32:31Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

-		}()
-	}
-
+	// update the shard record first


Why update the shard record first? Traditionally, updating the shard record means the reparent is done. That's how we ensure that subsequent TER calls will fix the tablet records if this attempt fails.

take a look at the latest and let me know if you still think this is a problem

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

go/vt/wrangler/reparent.go

deepthi · 2019-09-03T18:42:30Z

go/vt/wrangler/reparent.go

@@ -498,22 +498,21 @@ func (wr *Wrangler) plannedReparentShardLocked(ctx context.Context, ev *events.R
 		wgSlaves.Wait()
 		return fmt.Errorf("failed to PopulateReparentJournal on master: %v", masterErr)
 	}
+	// Wait for the slaves to complete.


making this change breaks one of the integ tests:
https://github.com/vitessio/vitess/blob/master/test/reparent.py#L572

We end up with this error:
TopologyServer has inconsistent state for shard master
because we have updated the masterElect to be master, but not updated the shard record.

hence I have reverted it.

cc: @enisoc

Yeah that test case raises a good point. We can't require slaves to complete in order to consider the reparent finished, because we need to be able to make progress even if some slaves are down.

enisoc · 2019-09-03T20:54:37Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

 	// timestamp to the current time.
 	agent.setExternallyReparentedTime(startTime)

-	if topoproto.TabletAliasEqual(si.MasterAlias, tablet.Alias) {


This short circuit was important because YouTube's equivalent of Orchestrator would call TabletExternallyReparented every 10 seconds on the master, even if nothing has changed. We don't do that yet in the Orchestrator integration, but it was always intended and the integration is incomplete without that constantly repeating signal.

!! Why did it do that? Was the intent to have it basically be a self healing loop?

enisoc · 2019-09-03T21:01:10Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

-	// This is where updateState will block for gracePeriod, while it gives
-	// vtgate a chance to stop sending replica queries.
-	agent.updateState(ctx, newTablet, "fastTabletExternallyReparented")
+	// always run finalize even if the master tablet has the correct state


This is starting to be a significant departure from the original design of TER - in particular the split into a fast stage with the short-circuit based on the shard record and a slow "finalize" stage that only runs when something has actually changed.

By recommending TER as a fix for cases when PlannedReparentShard fails, we're also mixing code paths that were never intended to be mixed. TER was supposed to only be used when Vitess-native reparents are disabled.

I'm starting to think what we really need is to make PlannedReparentShard itself more robust. If PRS fails, the solution should be to run PRS again pointing at the same master. It's been a longstanding problem with Vitess reparents that there are no commands to recover from partial operations. I don't think co-opting TER for a purpose outside its intended scope is the right answer.

Is this a good summary of the original design?
https://vitess.io/docs/user-guides/reparenting/#external-reparenting
Code in master as of right now doesn't actually perform part2 of step 5 (change tablet type to spare).

It's been a longstanding problem with Vitess reparents that there are no commands to recover from partial operations.

This is a bit out of scope for this PR, but 👍 I agree completely. Partial Vitess reparents have been tricky for us to manually recover from in the past, and it'd be great to make it safe to retry without having to manually inspect the cluster.

TER was supposed to only be used when Vitess-native reparents are disabled.

This is really interesting history that I didn't know!

part2 of step 5 (changing the tablet type of the unresponsive old master to spare) seems like it would actually solve a lot of problems we've seen when a master is unavailable for a bit, a TER happens, and then the master comes back online. I would prefer the spare (or other non-serving tablet type) to setting it to replica.

enisoc · 2019-09-03T21:04:00Z

go/vt/wrangler/reparent.go

@@ -498,22 +498,21 @@ func (wr *Wrangler) plannedReparentShardLocked(ctx context.Context, ev *events.R
 		wgSlaves.Wait()
 		return fmt.Errorf("failed to PopulateReparentJournal on master: %v", masterErr)
 	}
+	// Wait for the slaves to complete.


Yeah that test case raises a good point. We can't require slaves to complete in order to consider the reparent finished, because we need to be able to make progress even if some slaves are down.

enisoc

Overall, I think we need to step back and redefine our goals. We shouldn't try to make TER better suited to fix when PRS experiences partial success. TER was never meant to be used on the same cluster as PRS at all.

Instead, our end goal should be that the fix for any partial failure of PRS is simply to run PRS again with the same arguments. We need to make PRS smart enough to skip things that are already consistent with the desired end state and fix up anything that isn't consistent.

sougou · 2019-09-03T21:45:09Z

I agree in principle with TER not being the remedy for PRS. This is mainly because PRS also needs to fix replication for the tablets that got left out.

One thing missing in PRS is the -cancel option. If the point of no return has not been reached (and the in-command cancel failed), one should be able to revert to the old master. I'm actually implementing this same capability in the resharding MigrateWrites command.

Finally, a sanity check to make sure replication is not too far behind. If so, don't even start the reparent.

deepthi · 2019-09-03T21:48:16Z

Overall, I think we need to step back and redefine our goals. We shouldn't try to make TER better suited to fix when PRS experiences partial success. TER was never meant to be used on the same cluster as PRS at all.

Instead, our end goal should be that the fix for any partial failure of PRS is simply to run PRS again with the same arguments. We need to make PRS smart enough to skip things that are already consistent with the desired end state and fix up anything that isn't consistent.

What is the strategy if TER fails (supposing you never ran a PRS)?

enisoc · 2019-09-03T22:08:36Z

One thing missing in PRS is the -cancel option. If the point of no return has not been reached (and the in-command cancel failed), one should be able to revert to the old master.

If we make PRS mean "fix up anything that's inconsistent with the idea of X being the master", then I would expect the "rollback" process to be "run PRS pointing at the old master".

Or do you feel it's important for usability or automation that the command doesn't require you to specify the old master? In that case, I still would prefer something like -current_master rather than -cancel so it works for fixing any kind of inconsistency, not just a prior PRS that failed before the point of no return.

What is the strategy if TER fails (supposing you never ran a PRS)?

In the world for which TER was designed (when all Vitess-native reparents are disabled), running TER again should always be the answer. It may not perfectly handle all possible cases, but that's the intended strategy.

sougou · 2019-09-03T23:41:34Z

If we make PRS mean "fix up anything that's inconsistent with the idea of X being the master", then I would expect the "rollback" process to be "run PRS pointing at the old master".

It comes down to whether we have enough metadata to revert the last attempted reparent. We should support -cancel only if we have that metadata, or decide to create it somewhere. Requiring user to specify -current_master will be problematic.

I'm actually planning to use the new set @@vitess_metadata. construct for MigrateWrites.

zmagg · 2019-09-04T21:22:22Z

At a high level, I'd love more context about how this PR fits in Vitess's design intentions around master failure. There's a lot of interesting historical nuggets of information around TER and PRS in this PR, as is, and I'm curious if @enisoc or @sougou have any more knowledge to share.

My understanding today about Vitess's design principles about master failure (which could be wrong), was that the intention is that on hard crashes of master tablets, that the masters should be deprovisioned and never restarted, as it's unsafe to have them serve again as masters. I'm curious if this was actually an explicit design principle of Vitess, or if this was just a side effect due to running on Borg.

With how we run Vitess at Slack, it is definitely unsafe for a master to come back again, even as a replica, after a hard crash. We run with innodb_flush_log_at_trx_commit=2 && sync_binlog=1 and this can result in transactions that are only on the hard-crashed master's local disk. If that master comes back (even as a replica), and is at some point promoted to be a master again, it can result in errant transactions.

In this PR, it seems like if a master hard crashes and restarts, it'll be made a replica again and available for later promotion to master, which would cause errant transactions and replication breakage. For us, we'd prefer to have any lingering tablets who think they're still the master be marked as spare or non serving types instead of replicas.

deepthi · 2019-09-05T00:01:54Z

@zmagg this is very good feedback. It prompted me to go look at the history of TER changing old master to spare vs replica. I found the commit but not an explanation 😞
1161251

It seems like there was a decision to reduce/remove usage of spare, but I think we'll need to have @enisoc or @sougou provide context.

sougou · 2019-09-05T00:09:02Z

At a high level, I'd love more context about how this PR fits in Vitess's design intentions around master failure. There's a lot of interesting historical nuggets of information around TER and PRS in this PR, as is, and I'm curious if @enisoc or @sougou have any more knowledge to share.

This PR itself doesn't change anything about how we handle master failures. It just fixes some problems we found where certain failures led to states that were hard to recover from. For clarity, we've now split this into three objectives, which will become separate PRs.

Make PRS more robust. Provide ways to heal the system if it failed. @deepthi will remove the PRS changes (if she hasn't done so already).
Make TER idempotent. If it fails, a retry should complete what it missed out.
Run a watchdog thread (likely in vtctld), that will continuously monitor for the two-master scenario and demote the impostor.

But the questions below are good ones, and answers are below:

My understanding today about Vitess's design principles about master failure (which could be wrong), was that the intention is that on hard crashes of master tablets, that the masters should be deprovisioned and never restarted, as it's unsafe to have them serve again as masters. I'm curious if this was actually an explicit design principle of Vitess, or if this was just a side effect due to running on Borg.

This was an explicit design principle while at YouTube. However, people have started using mounted drives, especially while running on Kubernetes. In such cases, they rely on the durability guarantees provided by the mounted drives and prefer to recover instead of using a restore. This is because kubernetes local storage is too ephemeral, and they want better assurances that all their data will not be lost if there's a catastrophic failure.

With how we run Vitess at Slack, it is definitely unsafe for a master to come back again, even as a replica, after a hard crash. We run with innodb_flush_log_at_trx_commit=2 && sync_binlog=1 and this can result in transactions that are only on the hard-crashed master's local disk. If that master comes back (even as a replica), and is at some point promoted to be a master again, it can result in errant transactions.

Right. But this failure mode is easy to remedy because the old master will not be able to replicate from the new master because of the errant GTIDs. If this happens, we just have to restart the vttablet in "restore mode" (empty datadir).

In this PR, it seems like if a master hard crashes and restarts, it'll be made a replica again and available for later promotion to master, which would cause errant transactions and replication breakage. For us, we'd prefer to have any lingering tablets who think they're still the master be marked as spare or non serving types instead of replicas.

This would be true if TER was used only for catastrophic failures. But vitess allows an external agent to reparent mysqls even if there was no crash, and TER could be used in such situations. If so, the old master is eligible to become a replica. YouTube ran it this way, and there may be others who do the same.

derekperkins · 2019-09-05T01:17:29Z

TER was never meant to be used on the same cluster as PRS at all.

Make PRS more robust. Provide ways to heal the system if it failed. @deepthi will remove the PRS changes (if she hasn't done so already).

I'm very excited about PRS improvements and eliminating TER from our management. We end up with imposter double masters fairly often.

enisoc · 2019-09-09T22:50:08Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

+	tablet := agent.Tablet()
+	tabletMap, err := agent.TopoServer.GetTabletMapForShard(ctx, tablet.Keyspace, tablet.Shard)
+	// make the channel buffer big enough that it doesn't block senders
+	tablets := make(chan topodatapb.Tablet, len(tabletMap))


Maybe call this tabletsToRefresh?

Since we're going to ignore errors from GetTabletMapForShard, len(tabletMap) could be 0 which would cause a deadlock below. Might as well add 1 to the buffer size.

Actually, we might not technically need a channel at all. Since we bail out if any errors occur, we should get the same result if we append tablets to a slice before launching each per-tablet goroutine (before we know whether the tablet record update will succeed, but after we decided to try to update it), which removes the need for synchronization. However, I'm fine keeping this as a channel if you prefer, since the proof that it isn't needed is a bit obscure.

I was being cautious here, because UpdateTabletFields reads the latest tablet info from topo, and updates and returns it. This ensures that we are using the latest tabletInfo, but maybe it doesn't matter?

Ah I see what you mean. I don't think it matters in this case since I believe tmc.RefreshState only looks at the tablet host/port, and we aren't changing those fields. There's a slight possibility the tablet might update its own host/port in between when we first read it and when we call RefreshState, but that window will always be non-zero and it doesn't get much smaller by taking the result from UpdateTabletFields, since we call those concurrently right after we fetched the latest values from GetTabletMapForShard.

That said, the above is yet another non-obvious nuance in the proof that we don't need synchronization, so it's probably better to just keep the synchronization to be defensive against future code changes.

enisoc · 2019-09-09T22:52:29Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

-	tablet := agent.Tablet()
+	// update any other tablets claiming to be MASTER also to REPLICA
+	for alias, tabletInfo := range tabletMap {
+		if alias != topoproto.TabletAliasString(agent.TabletAlias) && alias != topoproto.TabletAliasString(oldMasterAlias) && tabletInfo.Tablet.Type == topodatapb.TabletType_MASTER {


You can compare directly without converting to string: topoproto.TabletAliasEqual(tabletInfo.Alias, x)

enisoc · 2019-09-09T22:55:54Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

@@ -160,9 +161,14 @@ func (agent *ActionAgent) finalizeTabletExternallyReparented(ctx context.Context
 		}
 	}()

+	tablet := agent.Tablet()
+	tabletMap, err := agent.TopoServer.GetTabletMapForShard(ctx, tablet.Keyspace, tablet.Shard)


We should log this error and add a comment saying we intentionally do not bail out because we still want to process whatever we found if there is a "partial result" error (some cells inaccessible), and even if this completely fails we still want to process the old master.

do you mean print to the log or collect it in errs?

I think it needs to be printed to log and not collected in errs. If we collect it in errs it will cause us to bail out later. However, that will mean TER gets stuck/blocked if any topo cells are unavailable.

This is a common concern throughout Vitess. Any time we try to access topo cross-cell (as we're doing here, since GetTabletMapForShard has to read tablets from all cells where the shard has tablets), we need to be careful to not get blocked in the case when one or more cells is unavailable. Otherwise, every cell becomes a single point of failure for that operation.

I suppose this is one of the reasons we call finalize with a relatively short timeout (30s), because any of the various things we are doing in finalize could block if the target component is unavailable.

enisoc · 2019-09-09T23:00:31Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

 	if !topoproto.TabletAliasIsZero(oldMasterAlias) {
 		wg.Add(1)
 		go func() {
+			log.Infof("finalizeTabletExternallyReparented: updating tablet record for old master: %v", oldMasterAlias)


By convention, defer wg.Done() should remain the first line so it's immediately clear that it always gets called.

enisoc · 2019-09-09T23:05:24Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

+		if alias != topoproto.TabletAliasString(agent.TabletAlias) && alias != topoproto.TabletAliasString(oldMasterAlias) && tabletInfo.Tablet.Type == topodatapb.TabletType_MASTER {
+			log.Infof("finalizeTabletExternallyReparented: updating tablet record for another old master: %v", alias)
+			wg.Add(1)
+			go func() {


Need to pass in a copy of any loop variables needed inside the goroutine, as you do below.

enisoc · 2019-09-09T23:19:40Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

@@ -221,22 +250,24 @@ func (agent *ActionAgent) finalizeTabletExternallyReparented(ctx context.Context
 			errs.RecordError(err)
 		}
 	}()
-	if !topoproto.TabletAliasIsZero(oldMasterAlias) {
+	wg.Wait()


Do we need a new Wait here? These steps were previously done concurrently on purpose, since they aren't interdependent.

…master to REPLICA Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi · 2019-09-10T03:14:56Z

@enisoc thanks for the detailed feedback. I have addressed the changes you requested.
Is there something further we can do to handle this?
"Any time we try to access topo cross-cell (as we're doing here, since GetTabletMapForShard has to read tablets from all cells where the shard has tablets), we need to be careful to not get blocked in the case when one or more cells is unavailable. Otherwise, every cell becomes a single point of failure for that operation."

enisoc

LGTM overall. Had one more potential defensive suggestion.

Is there something further we can do to handle this?

It looks ok to me, afaict. I only mentioned that as justification for not bailing out on a GetTabletMapForShard error, which we don't do.

enisoc · 2019-09-10T04:58:54Z

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

+				var err error
+				tab, err := agent.TopoServer.UpdateTabletFields(ctx, alias,
+					func(tablet *topodatapb.Tablet) error {
+						tablet.Type = topodatapb.TabletType_REPLICA


Maybe this is overly paranoid, but perhaps we should add both here and in the original code a check that the tablet type is still MASTER before we force it to REPLICA? There's a small chance we might race with a human trying to do something like force it SPARE, and then we overwrite it to REPLICA.

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc

LGTM

go/vt/wrangler/testlib/reparent_external_test.go

Signed-off-by: deepthi <deepthi@planetscale.com>

sougou

I was thinking of one minor improvement in the flow: update the MasterAlias as soon as we have successfully updated the old master to be a replica. There's no use remembering the old value after this.

This way, if the next step of chasing down the rest of the tablets fails, then the updated MasterAlias is still usable by others.

enisoc · 2019-09-11T16:44:21Z

I thought one of the goals was that, whenever feasible, rerunning TER should fix up anything that was missed due to partial failure. If we update the shard record but updating the impostor master fails, then we will not retry next time since we assume the shard record being updated means everything is done.

sougou · 2019-09-11T17:38:17Z

Right. I was only recommending that the MasterAlias be changed after we've successfully downgraded the old master. But before checking the rest of the tablets (// update any other tablets claiming to be MASTER also to REPLICA)

sougou · 2019-09-11T17:40:25Z

Specifically, the use case I have in mind is the one where we update the old master successfully, but the part that checks if there are other impostors chronically fails. If so, we'll never update the MasterAlias, which will cause all workflows that rely on this info to fail.

enisoc · 2019-09-11T17:47:40Z

but the part that checks if there are other impostors chronically fails

If we fail to read topo to even check for impostors, we just skip the check. It will only block if we actually do find an impostor master. In that case, do you really prefer that we leave what we know is an impostor master behind and mark the shard as done so we never retry fixing the impostor master?

I'm not asking because I disagree. I just want to be clear on the trade-off we're making.

enisoc · 2019-09-11T19:55:18Z

@deepthi Sugu and I talked and worked out his concerns. Here's what we're thinking now:

Remove the short circuit on the shard record being up-to-date, like you had done originally. This ensures future TER calls will try to do cleanup even if the shard record had already been updated. I previously objected to removing the short circuit based on the idea that we might emulate Decider by calling TER at a fast frequency. However, Sugu convinced me there's a better solution (outside the scope of this PR) that will mean we won't have to call TER that often, so it's fine to drop the short circuit.
Update the shard record as soon as the new and old master tablet records have been updated. This ensures that everyone knows who the new master is even if we fail to update some of the impostors, which in the context of TER is only a best-effort cleanup.
After updating the shard record, we should immediately start trying to refresh the old master, since ensuring it knows it's not master is the next most important thing.
Only after all that do we start trying to do best-effort cleanup of impostors: listing tablets for the shard, updating tablet records, and refreshing tablets.

How does that sound to you?

deepthi · 2019-09-11T20:07:36Z

Sounds good. I will make the changes.
This reminded me of why I had added the extra wg.Wait() in the first place. It was to ensure RefreshState on the old master completed before we started trying to update any impostors.

…does not complete all steps Signed-off-by: deepthi <deepthi@planetscale.com>

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

deepthi · 2019-09-11T21:16:08Z

@sougou @enisoc this PR is ready for final review.

sougou · 2019-09-11T22:39:36Z

LGTM
@enisoc: we can merge once you eyeball it.

Signed-off-by: deepthi <deepthi@planetscale.com>

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

…oops into 1 and remove use of channel that is no longer needed Signed-off-by: deepthi <deepthi@planetscale.com>

go/vt/vttablet/tabletmanager/rpc_external_reparent.go

… use one tmc for all calls to RefreshState Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc

lgtm

deepthi requested a review from sougou as a code owner August 31, 2019 01:01

deepthi requested a review from enisoc August 31, 2019 01:01

enisoc reviewed Aug 31, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_external_reparent.go Outdated Show resolved Hide resolved

enisoc reviewed Aug 31, 2019

View reviewed changes

go/vt/wrangler/reparent.go Outdated Show resolved Hide resolved

deepthi commented Sep 3, 2019

View reviewed changes

deepthi force-pushed the ds-reparent branch from 6fd2a25 to b540990 Compare September 3, 2019 20:24

deepthi requested a review from dkhenry September 3, 2019 20:33

enisoc reviewed Sep 3, 2019

View reviewed changes

enisoc requested changes Sep 3, 2019

View reviewed changes

deepthi force-pushed the ds-reparent branch from b540990 to 95150af Compare September 3, 2019 22:48

deepthi changed the title ~~make reparents more robust~~ make TabletExternallyReparented more robust Sep 7, 2019

This was referenced Sep 9, 2019

Make reparents more robust #5172

Closed

Handle two-master scenarios #5173

Closed

enisoc requested changes Sep 9, 2019

View reviewed changes

deepthi added 2 commits September 9, 2019 16:25

along with old master also update any other tablet that thinks it is …

a65a7f6

…master to REPLICA Signed-off-by: deepthi <deepthi@planetscale.com>

address review comments

7bd6618

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi force-pushed the ds-reparent branch from 95150af to 7bd6618 Compare September 10, 2019 03:10

enisoc reviewed Sep 10, 2019

View reviewed changes

check tablet is really master before changing type, add unit test

2b718e9

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc approved these changes Sep 10, 2019

View reviewed changes

deepthi commented Sep 10, 2019

View reviewed changes

go/vt/wrangler/testlib/reparent_external_test.go Show resolved Hide resolved

fix race condition, one more unit test

8236bdb

Signed-off-by: deepthi <deepthi@planetscale.com>

sougou reviewed Sep 11, 2019

View reviewed changes

make TER idempotent so that it can be called again if the first call …

ccba3db

…does not complete all steps Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi commented Sep 11, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_external_reparent.go Show resolved Hide resolved

add comments explaining why some synchronization has been removed

5515740

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc reviewed Sep 12, 2019

View reviewed changes

fix/add descriptive comments, avoid unnecessary topo calls, merge 2 l…

6b2d0d1

…oops into 1 and remove use of channel that is no longer needed Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi commented Sep 12, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_external_reparent.go Outdated Show resolved Hide resolved

enisoc reviewed Sep 13, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_external_reparent.go Outdated Show resolved Hide resolved

enisoc reviewed Sep 13, 2019

View reviewed changes

go/vt/vttablet/tabletmanager/rpc_external_reparent.go Outdated Show resolved Hide resolved

defer event update, only make a copy of tablet if we are changing it,…

90d3b8a

… use one tmc for all calls to RefreshState Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc approved these changes Sep 13, 2019

View reviewed changes

deepthi merged commit 6ff90ac into vitessio:master Sep 13, 2019

enisoc deleted the ds-reparent branch September 13, 2019 23:34

deepthi mentioned this pull request Sep 21, 2019

TER should not demote current master if it is run with current master #5210

Merged

deepthi mentioned this pull request Oct 31, 2019

Consistently name the topology service vitessio/website#327

Merged

spark4 mentioned this pull request Nov 12, 2019

Serry deploy tinyspeck/vitess#140

Closed

spark4 mentioned this pull request Nov 22, 2019

Slack sync upstream 2019 11 09.r0 tinyspeck/vitess#142

Merged

rafael mentioned this pull request Dec 11, 2019

Slack sync upstream 2019 12 11.r0 tinyspeck/vitess#143

Merged

enisoc mentioned this pull request May 20, 2020

Make emergency reparents more robust. #6206

Closed

13 tasks

make TabletExternallyReparented more robust #5151

make TabletExternallyReparented more robust #5151

Conversation

deepthi commented Aug 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi Sep 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enisoc left a comment

Choose a reason for hiding this comment

sougou commented Sep 3, 2019

deepthi commented Sep 3, 2019 • edited Loading

enisoc commented Sep 3, 2019

sougou commented Sep 3, 2019

zmagg commented Sep 4, 2019

deepthi commented Sep 5, 2019 • edited Loading

sougou commented Sep 5, 2019

derekperkins commented Sep 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi commented Sep 10, 2019

enisoc left a comment

Choose a reason for hiding this comment

enisoc Sep 10, 2019 • edited by deepthi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enisoc left a comment

Choose a reason for hiding this comment

sougou left a comment

Choose a reason for hiding this comment

enisoc commented Sep 11, 2019

sougou commented Sep 11, 2019

sougou commented Sep 11, 2019

enisoc commented Sep 11, 2019

enisoc commented Sep 11, 2019

deepthi commented Sep 11, 2019

deepthi commented Sep 11, 2019

sougou commented Sep 11, 2019

enisoc left a comment

Choose a reason for hiding this comment

deepthi commented Aug 31, 2019 •

edited

Loading

deepthi Sep 3, 2019 •

edited

Loading

deepthi commented Sep 3, 2019 •

edited

Loading

deepthi commented Sep 5, 2019 •

edited

Loading

enisoc Sep 10, 2019 •

edited by deepthi

Loading