[topology] Global Topology Serving Shards refactor #4496

rafael · 2019-01-02T21:56:41Z

Feature Description and motivation

In order to route to a given shard, Vitess needs to know whether or not a shard is Serving. Currently, there is no canonical way to determine this. The following is a proposal to simplify Global topology and serving keyspace graph generation that will allow us to have a canonical representation of serving shards.

I see two main reasons to go in this refactor right now:

Not having this canonical way to know if a shard is serving, introduces incidental complexity that bleeds to different parts of the code base (more details in how it works today section)
Current design has the side effect that you can't register vtgates in cells that don't have tablets. This creates problems with multi cell region deployments where it is possible to have vtgates in cells that don't have tablets. A workaround was proposed in [topology] Deprecate Shard cells #4482, but as part of that exploratory work, it seems like a higher level refactor like the one proposed here is a better approach.

How it works today

Today there are three ways to know if a shard is serving or not:

Shard ServedTypes: This information is set in the shard topology record on creation.
TabletControl DisableQueryService: This is set during resharding worfklows when migrating serving types.
SourceShards: This is set during split clone workflows. When set, tablets set themselves to not serving.

You can see this playing in action in tablet state change.

It is important to note that this information is set at the cell level and gets replicated in different parts of the topology (Shard cells, ServedTypes Cells, TabletControl, etc).

Proposal

We can get to a canonical representation with the following changes:

Serving KeyspaceGraph information (which is stored in each cell) becomes the source of truth for serving shards.
Deprecate the use of ServedTypes/TabletControls.DisableQueryService/Cells from Shard record.
Add a new field to Shard record to mark serving state (IsMasterServing). The value of this attribute can keep the same semantics we have today for CreateShard (i.e on shard creation, it will guarantee that there are no overlapping shards in a serving state).
When generating ServingKeyspace graph, if a record does not exist in a shard for a given cell, IsMasterServing attribute will be used to derive RDONLY/REPLICA/MASTER serving types for t
he cell.
When migrating serving types, instead of using TabletControls, ServingKeyspaceGraph will be updated directly in the local topology.

This covers the core of the refactor. There is one caveat: during SplitClone a change to tablet API will need to be made so it can write to a non serving shard, right now that works because non serving shards a writeable due to the use of SourceShards in tablet state change.

Besides giving us a canonical representation of serving shards, this refactor also reduces the dependency on global topology and removes the complexity around keeping cell information in the Shard record.

The text was updated successfully, but these errors were encountered:

Initial stab at: vitessio#4496 Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

mpawliszyn · 2019-02-19T15:36:46Z

Is this upgrading the lockserver to a single point of failure for general operation of the system. Does it mean the lockserver must always be writable in order to do any type of failover or regular maintenance to the system?

The current way means that the overall configuration of the system is in the lockserver but the authority of who was serving was the job of the tablet healthcheck. This makes the system resilient to any problems with the lockserver. We have had situations where the lockserver had an issue but vitess kept chugging along. This is exactly the kind of resilience we need for high capacity, highly available systems.

We have noticed a large increase in the load to our lockserver and wonder if changes to this end could be to blame.

rafael · 2019-02-19T19:52:46Z

Hi @mpawliszyn -

The ultimate goal of this refactor is to reduce dependencies from global topo for general operations of the system. Once we are done with these changes, only lockserver in relevant cells should be available for writes when doing a failover/PlannedReparent/adding tablets to a cell. Having said this, even when we are done with #4496 we will be along way before we get to those goals. I'll be reducing dependencies into global topo, but not all of them.

The following assumption is correct and won't change after this refactor:

The current way means that the overall configuration of the system is in the lockserver but the authority of who was serving was the job of the tablet healthcheck

What do you mean by regular maintenance of the system?

Could you elaborate more on the last bit too? I'm not sure I'm following. None of the changes proposed in this work are merged, so it can't change current system behavior.

mpawliszyn · 2019-02-21T19:15:02Z

Yeah if nothing has been checked in then I guess I cannot be this particular change. Glad that we are still using the tablet's own information for figuring out what is healthy and what is master.

Keeping the existing system going by doing reparents and handling an outage on a tablet or even OS upgrades is the kind of regular maintenance that we should be able to do without a functioning lockserver and sound like it will still be the case. Otherwise as our systems get bigger we are at a higher risk for cascading failures.

This was referenced Jan 2, 2019

Deprecate use shard cells when rebuilding keyspace graph in a cell. #4485

Closed

[topology] Deprecate Shard cells #4482

Closed

rafael added a commit to tinyspeck/vitess that referenced this issue Feb 5, 2019

Move serving state from global topo to srv keyspace

196cf77

Initial stab at: vitessio#4496 Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

rafael mentioned this issue Feb 14, 2019

4496 topo serving shards refactor #4631

Merged

9 tasks

rafael closed this as completed Mar 21, 2019

rafael mentioned this issue Apr 4, 2019

Refactor SetShardServingTypes into SetShardIsMasterServing #4787

Merged

rafael mentioned this issue Nov 8, 2019

Adds cells alias to docs vitessio/website#355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[topology] Global Topology Serving Shards refactor #4496

[topology] Global Topology Serving Shards refactor #4496

rafael commented Jan 2, 2019 •

edited

Loading

mpawliszyn commented Feb 19, 2019

rafael commented Feb 19, 2019

mpawliszyn commented Feb 21, 2019

[topology] Global Topology Serving Shards refactor #4496

[topology] Global Topology Serving Shards refactor #4496

Comments

rafael commented Jan 2, 2019 • edited Loading

Feature Description and motivation

How it works today

Proposal

mpawliszyn commented Feb 19, 2019

rafael commented Feb 19, 2019

mpawliszyn commented Feb 21, 2019

rafael commented Jan 2, 2019 •

edited

Loading