Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to demote master vttablet to SPARE instead of REPLICA #5520

Merged
merged 1 commit into from
Dec 20, 2019

Conversation

kalfonso
Copy link
Contributor

@kalfonso kalfonso commented Dec 6, 2019

In our setups we use SPARE tablets to take over MASTER role during reparents or failures. Currently shard sync and external reparent demote MASTER to REPLICA. This change allows to configure the tablet type of a demoted master. SPARE and REPLICA are the only ones allowed. REPLICA is the default one if flag is not set. This can also be achieved with a boolean flag if only SPARE and REPLICA make sense as the only tablet types that can be promoted to master

Signed-off-by: Karel Alfonso Sague kalfonso@squareup.com

@kalfonso kalfonso requested a review from sougou as a code owner December 6, 2019 02:12
@kalfonso kalfonso force-pushed the kalfonso.191205-demote-master-type branch 3 times, most recently from 78b0b2d to 12d46db Compare December 6, 2019 04:35
go/vt/vttablet/tabletmanager/action_agent.go Outdated Show resolved Hide resolved
@@ -171,7 +171,7 @@ func (agent *ActionAgent) finalizeTabletExternallyReparented(ctx context.Context
// Call UpdateTabletFields instead of ChangeType so that we can check the type
// before changing it and avoid unnecessary topo updates
var err error
oldMasterTablet, err = topotools.ChangeType(ctx, agent.TopoServer, oldMasterAlias, topodatapb.TabletType_REPLICA, nil)
oldMasterTablet, err = topotools.ChangeType(ctx, agent.TopoServer, oldMasterAlias, agent.DemoteMasterType, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once all components are upgraded to 4.0.0+, this RPC doesn't get called any more. It is here for backwards compatibility during rollouts. It's best to not make any changes to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reverted changes in this RPC path

@@ -589,6 +592,50 @@ func TestRPCTabletExternallyReparentedRerun(t *testing.T) {

}

func TestRPCTabletExternallyReparentedDemotesMasterToConfiguredTabletType(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test becomes obsolete along with the RPC. The new test should go into https://github.com/vitessio/vitess/blob/master/go/vt/wrangler/testlib/external_reparent_test.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@deepthi deepthi requested a review from enisoc December 6, 2019 04:37
@kalfonso kalfonso force-pushed the kalfonso.191205-demote-master-type branch 2 times, most recently from f4d0535 to 0822e6f Compare December 6, 2019 06:04
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We'll let @enisoc take a look before approving.

Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepthi Do we not need to update any other demotion-related code paths? Does this really cover all types of internal/external reparent? This almost feels too easy. :)

@@ -252,6 +254,11 @@ func NewActionAgent(
return nil, err
}

tabletType, err := validateDemoteMasterType(*demoteMasterType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The var name tabletType might mislead a reader to think it's the type that will be used at startup (since this is inside NewActionAgent). Let's call it something like demoteMasterTabletType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, done!

@@ -849,3 +870,12 @@ func (agent *ActionAgent) withRetry(ctx context.Context, description string, wor
}
}
}

func validateDemoteMasterType(tabletTypeParam string) (*topodatapb.TabletType, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We treat topodatapb.TabletType as a scalar type, so we usually pass it by value:

Suggested change
func validateDemoteMasterType(tabletTypeParam string) (*topodatapb.TabletType, error) {
func validateDemoteMasterType(tabletTypeParam string) (topodatapb.TabletType, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


func validateDemoteMasterType(tabletTypeParam string) (*topodatapb.TabletType, error) {
tabletType, err := topoproto.ParseTabletType(tabletTypeParam)
if err != nil && (tabletType != topodatapb.TabletType_SPARE && tabletType != topodatapb.TabletType_REPLICA) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you meant for that first && to be an ||. However, even then I don't think it will do the right thing. That might return a nil err when there's actually a problem.

I think what we want is:

if err != nil {
  return topodatapb.TabletType_UNKNOWN, err
}
if tabletType != topodatapb.TabletType_SPARE && tabletType != topodatapb.TabletType_REPLICA {
  return topodatapb.TabletType_UNKNOWN, fmt.Errorf("invalid demote_master_type %v; can only be REPLICA or SPARE", tabletType)
}
return tabletType, nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, fixed. Using unknown tablet type as suggested.

}
t.Fatalf("old master (%v) should be replica but is: %v", topoproto.TabletAliasString(oldMaster.Tablet.Alias), tablet.Type)
}
// check the old master was converted to replica
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// check the old master was converted to replica
// check the old master was converted to spare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kalfonso kalfonso force-pushed the kalfonso.191205-demote-master-type branch from 0822e6f to 8d674bc Compare December 8, 2019 21:24
@kalfonso
Copy link
Contributor Author

kalfonso commented Dec 8, 2019

@enisoc @deepthi: One other area that will be impacted by this change is when checking whether a tablet is eligible to become master rpc_replication.go:781, especially when fixing semi sync replication.

I think under semi sync replication with two set of tablets SPARE and REPLICAS replicating but only allowing SPARE to become master, there could be a situation where a lagging SPARE (with respect to a REPLICA) will be promoted to MASTER causing a replica read to return must up-to-date value than the new master.

This is not a problem in our case as we use AWS Aurora and we run vttablets with semi sync disabled (default). I think demoting to a SPARE should be only allowed if semi sync is disabled to avoid the problem above. I might be missing more, please let me know.

@tirsen
Copy link
Collaborator

tirsen commented Dec 9, 2019

This flag is only safe to use together with --disable_active_reparents. You should add code that checks we are also setting that flag.

@kalfonso kalfonso force-pushed the kalfonso.191205-demote-master-type branch from 8d674bc to bb93ade Compare December 11, 2019 00:06
if tabletType != topodatapb.TabletType_SPARE && tabletType != topodatapb.TabletType_REPLICA {
return topodatapb.TabletType_UNKNOWN, fmt.Errorf("invalid demote_master_type %v; can only be REPLICA or SPARE", tabletType)
}
if tabletType == topodatapb.TabletType_SPARE && !*mysqlctl.DisableActiveReparents {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should check for any value that is not REPLICA. I.e. you should not able to use this flag at all unless you also set --disable_active_reparents

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if it has been passed at all in the Golang flag API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Refactored to parse flag inside validation

In our setups we use SPARE tablets to take over MASTER role during reparents or failures. Currently shard sync and external reparent demote MASTER to REPLICA. This change allows to configure the tablet type of a demoted master. SPARE and REPLICA are the only ones allowed. REPLICA is the default one if flag is not set. This can also be achieved with a boolean flag if only SPARE and REPLICA make sense as the only tablet types that can be promoted to master

Signed-off-by: Karel Alfonso Sague <kalfonso@squareup.com>
@kalfonso kalfonso force-pushed the kalfonso.191205-demote-master-type branch from bb93ade to c2eec14 Compare December 11, 2019 00:36
@deepthi
Copy link
Member

deepthi commented Dec 18, 2019

@enisoc @deepthi: One other area that will be impacted by this change is when checking whether a tablet is eligible to become master rpc_replication.go:781, especially when fixing semi sync replication.

I think under semi sync replication with two set of tablets SPARE and REPLICAS replicating but only allowing SPARE to become master, there could be a situation where a lagging SPARE (with respect to a REPLICA) will be promoted to MASTER causing a replica read to return must up-to-date value than the new master.

This is not a problem in our case as we use AWS Aurora and we run vttablets with semi sync disabled (default). I think demoting to a SPARE should be only allowed if semi sync is disabled to avoid the problem above. I might be missing more, please let me know.

This problem existed even before we refactored the reparenting. Do you want to handle it in this PR, or leave it for another day? If you don't include it here, could you please create an issue?

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@deepthi
Copy link
Member

deepthi commented Dec 18, 2019

@morgo how do we get the github action statuses to update? they seem to be stuck.

@sougou
Copy link
Contributor

sougou commented Dec 20, 2019

Build is borked. Forcing this through.

@sougou sougou merged commit 89199f2 into vitessio:master Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants