-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[topology] Global Topology Serving Shards refactor #4496
Comments
Initial stab at: vitessio#4496 Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Is this upgrading the lockserver to a single point of failure for general operation of the system. Does it mean the lockserver must always be writable in order to do any type of failover or regular maintenance to the system? The current way means that the overall configuration of the system is in the lockserver but the authority of who was serving was the job of the tablet healthcheck. This makes the system resilient to any problems with the lockserver. We have had situations where the lockserver had an issue but vitess kept chugging along. This is exactly the kind of resilience we need for high capacity, highly available systems. We have noticed a large increase in the load to our lockserver and wonder if changes to this end could be to blame. |
Hi @mpawliszyn - The ultimate goal of this refactor is to reduce dependencies from global topo for general operations of the system. Once we are done with these changes, only lockserver in relevant cells should be available for writes when doing a failover/PlannedReparent/adding tablets to a cell. Having said this, even when we are done with #4496 we will be along way before we get to those goals. I'll be reducing dependencies into global topo, but not all of them. The following assumption is correct and won't change after this refactor:
What do you mean by regular maintenance of the system? Could you elaborate more on the last bit too? I'm not sure I'm following. None of the changes proposed in this work are merged, so it can't change current system behavior. |
Yeah if nothing has been checked in then I guess I cannot be this particular change. Glad that we are still using the tablet's own information for figuring out what is healthy and what is master. Keeping the existing system going by doing reparents and handling an outage on a tablet or even OS upgrades is the kind of regular maintenance that we should be able to do without a functioning lockserver and sound like it will still be the case. Otherwise as our systems get bigger we are at a higher risk for cascading failures. |
Feature Description and motivation
In order to route to a given shard, Vitess needs to know whether or not a shard is Serving. Currently, there is no canonical way to determine this. The following is a proposal to simplify Global topology and serving keyspace graph generation that will allow us to have a canonical representation of serving shards.
I see two main reasons to go in this refactor right now:
How it works today
Today there are three ways to know if a shard is serving or not:
You can see this playing in action in tablet state change.
It is important to note that this information is set at the cell level and gets replicated in different parts of the topology (Shard cells, ServedTypes Cells, TabletControl, etc).
Proposal
We can get to a canonical representation with the following changes:
ServedTypes
/TabletControls.DisableQueryService
/Cells
from Shard record.he cell.
This covers the core of the refactor. There is one caveat: during SplitClone a change to tablet API will need to be made so it can write to a non serving shard, right now that works because non serving shards a writeable due to the use of
SourceShards
in tablet state change.Besides giving us a canonical representation of serving shards, this refactor also reduces the dependency on global topology and removes the complexity around keeping cell information in the Shard record.
The text was updated successfully, but these errors were encountered: