Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh replicas and rdonly after MigrateServedTypes except on skipRefreshState. #7327

Merged
merged 1 commit into from
Feb 25, 2021

Conversation

makinje16
Copy link
Contributor

@makinje16 makinje16 commented Jan 20, 2021

Description

This PR changes MigrateServedTypes to refresh both Replicas and Rdonly tablets after every migration. Ran into an issue during a migration in which a replica was promoted to primary but hadn't been refreshed since we initially only refreshed tablets with servedType. With the primaries migrated and refreshed but not the replicas, when there is a failover before the replicas have had their state refreshed, the replica will still believe it is in a split state causing it not to serve until RefreshState is done on the host.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build
  • VTAdmin

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, just one comment that needs to be updated now. I restarted the failed jobs for you.

I can't tell if this is flaking or if your change broke these tests https://github.com/vitessio/vitess/pull/7327/checks?check_run_id=1737826998#step:5:1932

        	            	</html>" does not contain "TabletControl.DisableQueryService set"
        	Test:       	TestMergeShardingIntShardingKey

// RefreshTabletsByShard calls RefreshState on all the tables of a
// given type in a shard. It would work for the master, but the
// RefreshTabletsByShard calls RefreshState on all of the tablets
// in a shard. It would work for the master, but the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work for the master, but the discovery wouldn't be very efficient

This is no longer true, correct?

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makinje16 did a first pass.

As discussed, this still needs to be squashed and sign the commits to comply with the DCO requirement.

@@ -459,13 +459,13 @@ func (wr *Wrangler) MigrateServedTypes(ctx context.Context, keyspace, shard stri
wr.Logger().Infof("WaitForDrain: Sleeping finished. Shutting down queryservice on old tablets now.")

rec := concurrency.AllErrorRecorder{}
// Refresh both source and destinations to make sure migrated primaries don't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment should be something like:

// Refresh both source and destinations to make sure they all have the latest shardInfo data.

go/vt/wrangler/keyspace.go Show resolved Hide resolved
@rohit-nayak-ps
Copy link
Contributor

The fix we (@rafael, @sougou and I) discussed yesterday: to expand the check in CheckTabletQueryService() to also look for the additional status master tablet with filtered replication on for non-serving tablets allowed the test to pass further than where it was failing yesterday.

But now I get that error due to replicas and rdonly partitions appearing in the SrvKeyspace. I guess one of the tablet refreshes introduced by this PR is causing this.

@rafael , @makinje16 , I think you will need to log the srvkeyspaces in the master branch and in your PR and see at which point the partitions are deviating. I don't really know enough about the old workflows to debug this efficiently.

@rafael
Copy link
Member

rafael commented Feb 18, 2021

But now I get that error due to replicas and rdonly partitions appearing in the SrvKeyspace. I guess one of the tablet refreshes introduced by this PR is causing this.

Which error exactly are you seeing? I'm curious if it's part of the same. Is it really SrvKeyspace that is deviating or that the tablet is starting query service because all the tablet control metadata has been removed and we clean up the state when finishing the migration.

@rohit-nayak-ps
Copy link
Contributor

Which error exactly are you seeing? I'm curious if it's part of the same. Is it really SrvKeyspace that is deviating or that the tablet is starting query service because all the tablet control metadata has been removed and we clean up the state when finishing the migration.

if len(shardServedTypes) > 0 {
    return fmt.Errorf("cannot migrate MASTER away from %v/%v until everything else is migrated. Make sure that the following types are migrated first: %v", si.Keyspace(), si.ShardName(), strings.Join(shardServedTypes, ", "))
}

Looks like the replica/rdonly tablets might be serving again after a refresh.

@rafael
Copy link
Member

rafael commented Feb 24, 2021

After some debugging and refreshing a bit our memories of how this part of the system works, here is the main key-takeaway on why we had so much trouble getting this test to pass:

We noticed something that doesn’t look right in the canServe function in tm_state.go that could be problematic after cutover.

This is the function:

func (ts *tmState) canServe(tabletType topodatapb.TabletType) string {
	if !topo.IsRunningQueryService(tabletType) {
		return fmt.Sprintf("not a serving tablet type(%v)", tabletType)
	}
	if ts.tabletControls[tabletType] {
		return "TabletControl.DisableQueryService set"
	}
	if tabletType == topodatapb.TabletType_MASTER && ts.isResharding {
		return "master tablet with filtered replication on"
	}
	return ""
}

Currently, we remove tabletControls after finishing the migration. This means that if we refresh the source they could go back to serving.

Currently it works a bit by luck for the MASTER type: we create the reverse replication stream as part of the migration and query service can't start if this is present. However, RDONLY and REPLICA will go back to serving as soon as we refresh the state on those tablets.

The concrete proposal to improve this edge case is to add extra validations in canServe in tm_state.go so a tablet can't start the query service if it's range is not part of the SrvKeyspaceGraph. This change requires some extra refactoring and it's out of scope for the immediate concern of making sure we refresh tablets on the destination. @makinje16 is already working on this and will have a follow up PR for this.

In the meantime, this LGTM to merge as is.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good, but I have a question on one of the comments.

go/vt/wrangler/keyspace.go Outdated Show resolved Hide resolved
Signed-off-by: Malcolm Akinje <makinje@slack-corp.com>
@rafael
Copy link
Member

rafael commented Feb 25, 2021

All comments have been addressed. Merging!

@rafael rafael merged commit e40e4c5 into vitessio:master Feb 25, 2021
@rafael rafael deleted the ma-refresh-fix branch February 25, 2021 18:42
@askdba askdba added this to the v10.0 milestone Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants