RFC: tablet throttler multi-metrics #15624

shlomi-noach · 2024-04-03T09:24:37Z

Today, table throttler uses a single metric by which to throttle. This metric is dynamically configurable, but is just the one. The default metric is replication lag, and can be modified based on any query that returns a scalar value, e.g. to return Threads_running.

We want the throttler to measure multiple metrics at once, and we want to be able to throttle based on a selective list of metrics. Such metrics could be:

To that effect, we want:

Tablets to always collect self multiple metrics on (on their own host or their designated MySQL server)
PRIMARY tablet to always collect all available metrics from replica tablet
Metrics should be identifiable by a designated name
Throttler check requests (mostly via throttler clients) should be able to specify the list of metrics on which they wish to throttle (e.g. "I care about replication lag, but fine to ignore load average")
User should be able to control the list of metrics for VReplication workflows (to be decided exactly how). And specifically for Online DDL. We will likely want to apply the same list of metrics for all workflows (ie we don't need different workflows to each have a different list of metrics on which to throttle)
Modifying list of metrics should apply dynamically to running workflows.
Throttler configuration should include expected thresholds per metric name.
We continue to apply throttler configuration across the keyspace (all tablets in all shards of a given keyspace align on the same single configuration)

Introducing multi-metrics dimension explodes the complexity of the throttler code. However, we are thankfully also able to reduce the complexity by getting rid of dimensions that we don't really use or need, and which were inherited from freno:

Clusters: today we use self and shard, but self isn't really a cluster, and the code largely handles it different than shard. We can therefore remove the "cluster" or "store" dimension.
- Likewise we can also remove the per-cluster configuration overrides.
Store types: we only use MySQL, We can remove the dimension.
Probe settings: we always probe by tablet, and the probe layer is mostly redundant.
Other.
We will need to be backwards compatible: multi-metric PRIMARY should work with v19 replicas, and vice versa.

This will cause a major rewrite, with some temporary redundancy code to support backwards compatibility. Hopefully we can simplify some existing complexities inherited from freno, or technical debt we've accumulated since.

Unit tests and endtoend tests will remain (and expand) to protect us against incompatibilities.

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2024-04-04T15:46:22Z

Observability: we should be able to track why a certain client was throttled, ie which specific metric it was throttled on.

shlomi-noach · 2024-04-04T16:00:34Z

Throttler check requests (mostly via throttler clients) should be able to specify the list of metrics on which they wish to throttle (e.g. "I care about replication lag, but fine to ignore load average")

The set of metrics specified by the client will AND with each other, ie if the client chooses to throttle based on lag,loadavg then both lag and loadavg need to individually pass for the overall check to pass.

I don't think it makes sense to OR or to have any other combination.

shlomi-noach · 2024-04-16T11:59:23Z

As mentioned above, we want to be able to change the list of considered metrics while an Online DDL operation is running (as an example). So that, for example, we want Online DDL to start throttling based on lag and based on load average, or then later on for it to stop throttling based on load average and remain just with lag.

IMO the way to do that is to associate metrics with an app name. All Online DDL operations use the app name "online-ddl". So the way would be to associate "online-ddl": "lag,loadavg".

That association will then either

~~make its way to the throttler client -- which then provides to the throttler the list of metrics its interested in,~~
or, keeping the throttler client ignorant, computed on behalf of the client by the throttler.

shlomi-noach · 2024-05-15T08:17:32Z

metrics can be collected from the single tablet being probed, or from the collective shard.

Replication lag is normally something you wish to collect from the entire shard (including primary), because you want to know about replica's lag. There is a strong reason to check on all shard servers.
What about load average? Are you concerned with the load average on the PRIMARY or are you concerned about the metric on replicas? There is no clear answer and you probably want to check on PRIMARY only.

To that effect:

A metric is associated with a scope (self/shard). Each metric has a default scope. lag uses shard, others use self.
A normal check will use the default scopes (per metric).
But the user may also indicate "I wish to check the entire shard for all metrics" or I wish to check self scope for all metrics". In which case we override the metrics' defaults.

Moreover, consider the discussion in previous comment re: associating metrics with apps. It will be even further possible to fine grain the checks by associating "online-ddl": "lag,shard/loadavg". Note:

the scope is not mandatory (nothing declared for lag, and so the scope for lag is the default one for this metric, which happens to be shard).
per-metric scopes are ignored by the self-checks, which are the mechanism by which the tablets collect their own metrics and by which the PRIMARY tablet collects metrics from the replicas.

shlomi-noach · 2024-05-15T08:19:24Z

Adding support for an all app, which is a catch-all for anything that's doesn't have any specific rules. With all, it is possible to do inverted rules, such as "everything is rejected, except this app which is allowed". Or, "everything throttles at 0.7 ratio for the next 2 hours, except these two apps, one of which is exempted in the next 5 hours, the other throttled at 0.2 ratio for the next 30min". Or also "everything is exempted, but this app needs to go through normal throttling".

shlomi-noach · 2024-05-15T08:21:04Z

Adding vtctldclient CheckThrottler command, which returns a detailed CheckThrottlerResponse. The command takes a tablet name as argument (potentially also it could take shard name, much like Backup and BackupShard). IT takes --app-name and --scope optional arguments as well as some extra flags.

shlomi-noach · 2024-05-16T05:58:36Z

Required additions to vtctldclient UpdateThrottlerConfig:

Updating the threshold for a given metric name. Setting threshold to 0 will remove the entry.
We can use the existing --threshold flag, and add --metric-name=... flag. IF the latter exists, then --threshold must be specified. If it does not exist, then we assume the "default" metric.
Setting the per app metrics. Something like --app-name=online-ddl --app-metrics=lag,shard/loadavg. The two flags must come together - either both exist, or none exists. It's OK to provide an empty --app-metric, in which case the throttler uses the default metrics for the given app. --app-name must not be empty. It can be "all".

shlomi-noach · 2024-05-16T06:03:59Z

Eventually (v21/v22/v23, depending), we will deprecate these flags in vtctldclient UpdateThrottlerConfig:

--check-as-check-self
--check-as-check-shard
We will also clean up these fields from UpdateThrottlerConfigRequest:
CheckAsCheckSelf
CheckAsCheckShard

shlomi-noach · 2024-05-19T12:29:06Z

Assigning metrics to "all" app should apply to 'all' app should apply to all apps which do not already have any explicit metrics assigned:

$ vtctldclient UpdateThrottlerConfig --app-name "all" --app-metrics "lag,loadavg" commerce

shlomi-noach · 2024-05-21T12:18:14Z

Addressed by #15988

shlomi-noach · 2024-05-26T05:57:12Z

Base branch PR for changes: planetscale:throttler-multi-metrics-incremental #16012, onto which we will merge multiple incremental PRs.

shlomi-noach · 2024-06-19T14:52:24Z

Beyond #15988:

Change status codes from HTTP to formal proto/constants (suggested by @timvaillancourt in Tablet throttler: multi-metric support #15988 (comment)) - Throttler: CheckThrottlerResponseCode to replace HTTP status codes #16491
Add more metrics, namely vttablet pool usage (suggested by @timvaillancourt in Tablet throttler: multi-metric support #15988 (comment))
v22: deprecate MultiMetricsEnabled (to be assumed always to be true)
v22: UpdateThrottlerConfig to remove app rules that are expired. See contents of 28dd1d2

shlomi-noach · 2024-07-03T12:18:59Z

More beyond #15988:

Ensure primary throttler uses persistent connections to the replica - Confirmed
_vt.vreplication add column describing reason for throttling - Throttler: return app name in check result, synthesize "why throttled" explanation from result #16416
Introducing new metrics: create an interface and make it a pluggable-like development experience - Throttler: SelfMetric interface, simplify adding new throttler metrics #16469
CheckThrottlerResponse to include app. This can be different than the app used in CheckThrottlerRequest in these cases:
- The requesting app had no specific rules, but fell under an all app rule.
- The requesting app was an aggregate (vreplication:vplayer:online-ddl) and was e.g. rejected based on one of the tokens (e.g. online-ddl).
  See Throttler: return app name in check result, synthesize "why throttled" explanation from result #16416
Deprecate check-as-check-self in UpdateThrottlerConfig - Deprecate UpdateThrottlerConfig's --check-as-check-self and --check-as-check-shard flags #16507
Deprecate HTTP endpoints and SQL syntax - Docs: tablet throttler deprecations in v21 website#1801
v22: remove HTTP endpoints and SQL syntax
v22: remove HTTP StatusCode and related logic (keep status code in /check header response).

shlomi-noach · 2024-07-11T10:00:09Z

Reopening as there is a bit of followup.

shlomi-noach · 2024-07-25T06:43:17Z

the utilisation of the connection pools in vttablet is the most reliable "catch-all",

@timvaillancourt circling back to connection pool usage, how do you choose reasonable values?? Do you only throttle when the pool is completely exhausted (ie Available() drops to zero?)
Or otherwise how do you decide that 100 used connections is fine but 120 isn't?

timvaillancourt · 2024-08-23T21:16:31Z

the utilisation of the connection pools in vttablet is the most reliable "catch-all",

@timvaillancourt circling back to connection pool usage, how do you choose reasonable values?? Do you only throttle when the pool is completely exhausted (ie Available() drops to zero?) Or otherwise how do you decide that 100 used connections is fine but 120 isn't?

@shlomi-noach our plan in txthrottler is to use a percent of pool usage and threshold flag to cause low-priority workloads to be potentially throttled (probabilistic). If the pool usage is below this threshold no workloads are throttled

shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Throttler labels Apr 3, 2024

shlomi-noach self-assigned this Apr 3, 2024

shlomi-noach changed the title ~~Tracking: tablet throttler multi-metrics~~ RFC: tablet throttler multi-metrics Apr 4, 2024

shlomi-noach mentioned this issue May 21, 2024

Tablet throttler: multi-metric support #15988

Merged

5 tasks

shlomi-noach mentioned this issue May 26, 2024

Throttler multi-metrics: an incremental PR #16012

Closed

5 tasks

This was referenced May 26, 2024

Tablet throttler: remove LowPriority logic #16013

Merged

Tablet throttler multi-metrics incremental PR: introducing metric names and scopes planetscale/vitess#93

Closed

This was referenced Jun 2, 2024

Throttler multi-metrics: an incremental PR #16039

Closed

Tablet throttler multi-metrics incremental PR: introducing metric names and scopes #16041

Closed

shlomi-noach mentioned this issue Jul 11, 2024

Docs: multi-metrics throttler vitessio/website#1786

Merged

shlomi-noach closed this as completed in #15988 Jul 11, 2024

shlomi-noach reopened this Jul 11, 2024

shlomi-noach mentioned this issue Jul 24, 2024

Throttler: SelfMetric interface, simplify adding new throttler metrics #16469

Merged

5 tasks

This was referenced Sep 23, 2024

Feature Request: MySQL load average & used disk space metrics in multi-metric throttler #16822

Open

mysqld system metrics, with TabletManager rpc #16850

Merged

This was referenced Oct 7, 2024

Tablet throttler: read and use MySQL host metrics #16904

Open

Multi-metrics throttler: post v21 deprecations and changes #16915

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: tablet throttler multi-metrics #15624

RFC: tablet throttler multi-metrics #15624

shlomi-noach commented Apr 3, 2024 •

edited

Loading

shlomi-noach commented Apr 4, 2024

shlomi-noach commented Apr 4, 2024 •

edited

Loading

shlomi-noach commented Apr 16, 2024 •

edited

Loading

shlomi-noach commented May 15, 2024 •

edited

Loading

shlomi-noach commented May 15, 2024 •

edited

Loading

shlomi-noach commented May 15, 2024 •

edited

Loading

shlomi-noach commented May 16, 2024 •

edited

Loading

shlomi-noach commented May 16, 2024

shlomi-noach commented May 19, 2024 •

edited

Loading

shlomi-noach commented May 21, 2024

shlomi-noach commented May 26, 2024

shlomi-noach commented Jun 19, 2024 •

edited

Loading

shlomi-noach commented Jul 3, 2024 •

edited

Loading

shlomi-noach commented Jul 11, 2024

shlomi-noach commented Jul 25, 2024

timvaillancourt commented Aug 23, 2024

RFC: tablet throttler multi-metrics #15624

RFC: tablet throttler multi-metrics #15624

Comments

shlomi-noach commented Apr 3, 2024 • edited Loading

shlomi-noach commented Apr 4, 2024

shlomi-noach commented Apr 4, 2024 • edited Loading

shlomi-noach commented Apr 16, 2024 • edited Loading

shlomi-noach commented May 15, 2024 • edited Loading

shlomi-noach commented May 15, 2024 • edited Loading

shlomi-noach commented May 15, 2024 • edited Loading

shlomi-noach commented May 16, 2024 • edited Loading

shlomi-noach commented May 16, 2024

shlomi-noach commented May 19, 2024 • edited Loading

shlomi-noach commented May 21, 2024

shlomi-noach commented May 26, 2024

shlomi-noach commented Jun 19, 2024 • edited Loading

shlomi-noach commented Jul 3, 2024 • edited Loading

shlomi-noach commented Jul 11, 2024

shlomi-noach commented Jul 25, 2024

timvaillancourt commented Aug 23, 2024

shlomi-noach commented Apr 3, 2024 •

edited

Loading

shlomi-noach commented Apr 4, 2024 •

edited

Loading

shlomi-noach commented Apr 16, 2024 •

edited

Loading

shlomi-noach commented May 15, 2024 •

edited

Loading

shlomi-noach commented May 15, 2024 •

edited

Loading

shlomi-noach commented May 15, 2024 •

edited

Loading

shlomi-noach commented May 16, 2024 •

edited

Loading

shlomi-noach commented May 19, 2024 •

edited

Loading

shlomi-noach commented Jun 19, 2024 •

edited

Loading

shlomi-noach commented Jul 3, 2024 •

edited

Loading