-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4496 topo serving shards refactor #4631
4496 topo serving shards refactor #4631
Conversation
Initial stab at: vitessio#4496 Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Instead of updating the serving keyspace one time per shard, just do it a single time with all the shards that we need to update. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
* Some general code cleanup. * Discovered bug in cancelMasterMigration. When there are failures updating the topology, source shards will ended being assigned as nil. This will cause a panic reversing the migration. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
VtCombo was not initializing shard or serving keyspace graph. Now that serving data is derived from here, we need to make sure that this gets called before gates can serve traffic. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
…anges Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
* I think this test was passing by luck because we didn't have any real operations that interacted with that cell. Given the global topology refactor part of this PR, this has changed and now this breaks. * I think the regression happened when samuel/go-zookeeper was introduced, but this was never discovered. This test was originally introduced in: vitessio@c685aaf Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
…igration Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
…shard Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
…anymore Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
ac08bec
to
d8afcf0
Compare
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
…egration test Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
* Also adds more validations and tests when rebuilding serving keyspace Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@@ -115,6 +124,403 @@ func (ts *Server) GetSrvKeyspaceNames(ctx context.Context, cell string) ([]strin | |||
} | |||
} | |||
|
|||
// GetShardServingCells returns cells where this shard is serving | |||
func (ts *Server) GetShardServingCells(ctx context.Context, si *ShardInfo) (servingCells []string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should not exist. We should instead just use all cells without assuming anything. And then deal with absence of ShardReference only if it's actually relevant.
Also, I think a ShardReference is supposed to be there in all cells, right? The only time it can be missing is if someone didn't do a Rebuild after creating a new shard.
In any case, it's better to go through all cells for these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I'll update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sugu - I was looking into this a bit more and I was thinking that it might be better do this in a future iteration.
There are some parts where I will need to make some slight modifications. For instance:
servingCells, err := wr.ts.GetShardServingCells(ctx, shardInfo)
if err != nil {
return err
}
// Check the Serving map for the shard, we don't want to
// remove a serving shard if not absolutely sure.
if !evenIfServing && len(servingCells) > 0 {
return fmt.Errorf("shard %v/%v is still serving, cannot delete it, use even_if_serving flag if needed", keyspace, shard)
}
or
func (wr *Wrangler) RemoveShardCell(ctx context.Context, keyspace, shard, cell string, force, recursive bool) error {
shardInfo, err := wr.ts.GetShard(ctx, keyspace, shard)
if err != nil {
return err
}
shardServingCells, err := wr.ts.GetShardServingCells(ctx, shardInfo)
// check the cell is in the list already
if !topo.InCellList(cell, shardServingCells) {
return fmt.Errorf("cell %v in not in shard info", cell)
}
It's not really big, but given that this PR is already huge. Thinking that making this change as a separate PR makes sense.
…-topo-serving-shards-refactor
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
That Add/Delete split didn't get that much better, but I think the call sites read better. So, I guess it's a net win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slowly looking through parts of this diff to have a better understanding of what'll change when we merge it. Sending a few comments/questions.
// We rebuild keyspace iff: | ||
// 1) shard master is in a serving state. | ||
// 2) shard has served type for master (this is for backwards compatibility). | ||
if !(si.IsMasterServing || si.GetServedType(topodatapb.TabletType_MASTER) != nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately clear to me why it wouldn't be necessarily anymore to update partition information for replica
and rdonly
served types for a shard even if the master isn't serving from that shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is one of the core parts of this change. @sougou and I realized that if a master is serving, from a logical perspective that's the only thing you need to know to generate the SrvKeyspaceGraph. You could be in one of these scenarios:
- User creates a new keyspace with new shards. In init tablet, because there are no overlapping shards IsMasterServing will be set to true. When you build the keyspace graph for this newly created keyspace, this code will generate the serving graph for all types: rdonly, replica, master.
- If you are creating shards in an existent keyspaces (to prepare for a split), InitTablet will IsMasterServing to false, and rebuild keyspace won't take those new shards into consideration for any type.
When you are in the process of migrating serving types, there is code there that will modify the serving keyspace graph to add/remove new shards accordingly. But that does not go through this code path.
_, err = wr.ts.UpdateShardFields(ctx, keyspace, shard, func(si *topo.ShardInfo) error { | ||
return si.UpdateServedTypesMap(servedType, cells, remove) | ||
}) | ||
// TODO: What should we do with this method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like SetShardServedTypes()
should probably be deleted along with the vtctl command.
If not, this method probably at least raise an error
Description
Things to keep an eye
Pending Tasks