-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh replicas and rdonly after MigrateServedTypes except on skipRefreshState. #7327
Conversation
91f0950
to
b4848b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, just one comment that needs to be updated now. I restarted the failed jobs for you.
I can't tell if this is flaking or if your change broke these tests https://github.com/vitessio/vitess/pull/7327/checks?check_run_id=1737826998#step:5:1932
</html>" does not contain "TabletControl.DisableQueryService set"
Test: TestMergeShardingIntShardingKey
go/vt/wrangler/keyspace.go
Outdated
// RefreshTabletsByShard calls RefreshState on all the tables of a | ||
// given type in a shard. It would work for the master, but the | ||
// RefreshTabletsByShard calls RefreshState on all of the tablets | ||
// in a shard. It would work for the master, but the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would work for the master, but the discovery wouldn't be very efficient
This is no longer true, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makinje16 did a first pass.
As discussed, this still needs to be squashed and sign the commits to comply with the DCO requirement.
go/vt/wrangler/keyspace.go
Outdated
@@ -459,13 +459,13 @@ func (wr *Wrangler) MigrateServedTypes(ctx context.Context, keyspace, shard stri | |||
wr.Logger().Infof("WaitForDrain: Sleeping finished. Shutting down queryservice on old tablets now.") | |||
|
|||
rec := concurrency.AllErrorRecorder{} | |||
// Refresh both source and destinations to make sure migrated primaries don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment should be something like:
// Refresh both source and destinations to make sure they all have the latest shardInfo data.
The fix we (@rafael, @sougou and I) discussed yesterday: to expand the check in But now I get that error due to replicas and rdonly partitions appearing in the SrvKeyspace. I guess one of the tablet refreshes introduced by this PR is causing this. @rafael , @makinje16 , I think you will need to log the srvkeyspaces in the master branch and in your PR and see at which point the partitions are deviating. I don't really know enough about the old workflows to debug this efficiently. |
Which error exactly are you seeing? I'm curious if it's part of the same. Is it really SrvKeyspace that is deviating or that the tablet is starting query service because all the tablet control metadata has been removed and we clean up the state when finishing the migration. |
Looks like the replica/rdonly tablets might be serving again after a refresh. |
d98f824
to
57d825d
Compare
57d825d
to
01112e2
Compare
d8036c5
to
04dd074
Compare
After some debugging and refreshing a bit our memories of how this part of the system works, here is the main key-takeaway on why we had so much trouble getting this test to pass: We noticed something that doesn’t look right in the canServe function in This is the function:
Currently, we remove tabletControls after finishing the migration. This means that if we refresh the source they could go back to serving. Currently it works a bit by luck for the MASTER type: we create the reverse replication stream as part of the migration and query service can't start if this is present. However, RDONLY and REPLICA will go back to serving as soon as we refresh the state on those tablets. The concrete proposal to improve this edge case is to add extra validations in In the meantime, this LGTM to merge as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks good, but I have a question on one of the comments.
Signed-off-by: Malcolm Akinje <makinje@slack-corp.com>
04dd074
to
aadba99
Compare
All comments have been addressed. Merging! |
Description
This PR changes MigrateServedTypes to refresh both Replicas and Rdonly tablets after every migration. Ran into an issue during a migration in which a replica was promoted to primary but hadn't been refreshed since we initially only refreshed tablets with
servedType
. With the primaries migrated and refreshed but not the replicas, when there is a failover before the replicas have had their state refreshed, the replica will still believe it is in a split state causing it not to serve until RefreshState is done on the host.Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: