Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeleteTablet: allow deletion of old master tablet without -allow_master #5647

Merged
merged 4 commits into from
Jan 7, 2020

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Jan 2, 2020

The way we define the true master in DeleteTablet is now consistent with how we define a shard's master, taking into account the shard's MasterAlias and MasterTermStartTime.

Signed-off-by: deepthi deepthi@planetscale.com

…er flag

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested a review from sougou as a code owner January 2, 2020 22:20
@deepthi deepthi requested review from enisoc and rafael January 2, 2020 22:20
Type: topodatapb.TabletType_MASTER,
}

if err := wr.InitTablet(context.Background(), tablet, false /*allowMasterOverride*/, true /*createShardAndKeyspace*/, false /*allowUpdate*/); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused how this test is working. Shouldn't init tablet fail here if there is already a master tablet?

if tablet.Type == topodatapb.TabletType_MASTER && si.HasMaster() && !topoproto.TabletAliasEqual(si.MasterAlias, tablet.Alias) && !allowMasterOverride {

Copy link
Member Author

@deepthi deepthi Jan 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

si.HasMaster() will be false (si.MasterAlias is nil until it is set).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the test case to more closely reflect the real-life use case.

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better to move the topotools.DeleteTablet step to after the wasMaster step. It will make the operation idempotent. Otherwise, if the wasMaster step fails, we cannot retry the Delete because the tablet record would have been deleted already.

@deepthi
Copy link
Member Author

deepthi commented Jan 3, 2020

I think it will be better to move the topotools.DeleteTablet step to after the wasMaster step. It will make the operation idempotent. Otherwise, if the wasMaster step fails, we cannot retry the Delete because the tablet record would have been deleted already.

Good point. I will make this change.

@@ -109,7 +109,13 @@ func (wr *Wrangler) DeleteTablet(ctx context.Context, tabletAlias *topodatapb.Ta
if err != nil {
return err
}
wasMaster := ti.Type == topodatapb.TabletType_MASTER
si, err := wr.ts.GetShard(ctx, ti.Keyspace, ti.Shard)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid depending on global topo being up whenever possible. In the common case where we're deleting a tablet whose record doesn't claim to be master, I think we should not bother checking the shard record.

}

// the true master tablet will have the same MasterTermStartTime as the shard
wasMaster := ti.Type == topodatapb.TabletType_MASTER && topoproto.TabletAliasEqual(si.MasterAlias, tabletAlias) && logutil.ProtoToTime(ti.MasterTermStartTime).Equal(logutil.ProtoToTime(si.MasterTermStartTime))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be a bit more tolerant of inconsistency. There are various reasons that the timestamps in the tablet and shard records might not match. For example, perhaps the tablet has started a new term but hasn't updated the shard record yet.

Here's the way I think of it:

  • If the tablet record claims to be non-master, believe it without checking the shard record.
    • Unless we find out there's a problematic case of inconsistency here, we should err on the side of keeping dependence on global topo as low as possible. The specific case we're trying to fix now is only when a tablet record does claim to be master.
  • If the tablet record claims to be master, check the shard record to confirm.
    • If the shard record has this tablet as master, assume it is master without checking timestamps.
    • If the shard record has another tablet as master, check the timestamps.
      • If the shard record's master timestamp is newer, assume this tablet is not the master.
      • If the tablet record's mater timestamp is newer, assume this tablet is the master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with being more tolerant of inconsistency. However, I'm on the fence about minimizing global topology dependency in this context. Here my rationale for this:

  • This is an admin operation explicitly triggered by an operator, I don't feel that strongly about minimal global topology footprint in those contexts. I would also feel different if this were an operation that could leave a serving tablet in a non serving state.
  • Given that there is still a case where you have to check the shard record, I would lean towards keeping the code as simple as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependence on global topo is tough to eliminate because it can pop up anywhere in a distributed call stack and cause the operation you were attempting to fail when global topo is unreachable. If we let more unnecessary cases creep in, it's only increasing the work we'll have to do later to meet our long-term goal of reducing dependence on global topo.

I think we can still keep the code simple while only talking to global topo when necessary. For example, if we break the logic out into a new function to answer "is this tablet potentially the master?", this function will remain clean, and we can even use that in other places to apply similar corrections where we're still trusting the tablet record alone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah don't take me wrong. I also agree and have been working on reducing dependency in global topo. Just that benefit is not obvious in the case, given that is not really removing it.

If we can keep it simple and clean, I'm on board. My high level concern was to not add complexity when the actual dependency is still there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue the extra complexity is modest and worthwhile to avoid increasing the dependency more than necessary. The kind of dependency I mean is not, "Does this code ever call global topo?" but rather, "What set of actions can I take while global topo is unavailable?"

Although this code will still contain a call to global topo either way, putting the call inside a conditional branch means we get to keep "delete non-master tablets" in the set of actions you can take while global topo is unavailable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that said, I also don't feel strongly about this. I'll side with whatever @deepthi thinks is the right balance.

…define shard master. Reduce dependence on global topo. Improve test case

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
go/vt/wrangler/tablet.go Show resolved Hide resolved
go/vt/wrangler/tablet.go Outdated Show resolved Hide resolved
Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@deepthi deepthi merged commit 2f0e6ae into vitessio:master Jan 7, 2020
@deepthi deepthi deleted the ds-delete-tablet-fake-master branch January 7, 2020 19:31
ajm188 pushed a commit to tinyspeck/vitess that referenced this pull request Mar 11, 2020
…ke-master

DeleteTablet: allow deletion of old master tablet without -allow_master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants