Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: tablet throttler multi-metrics #15624

Open
13 of 14 tasks
shlomi-noach opened this issue Apr 3, 2024 · 16 comments · Fixed by #15988
Open
13 of 14 tasks

RFC: tablet throttler multi-metrics #15624

shlomi-noach opened this issue Apr 3, 2024 · 16 comments · Fixed by #15988
Assignees
Labels
Component: Throttler Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Apr 3, 2024

Today, table throttler uses a single metric by which to throttle. This metric is dynamically configurable, but is just the one. The default metric is replication lag, and can be modified based on any query that returns a scalar value, e.g. to return Threads_running.

We want the throttler to measure multiple metrics at once, and we want to be able to throttle based on a selective list of metrics. Such metrics could be:

  • Replication lag
  • Threads_running
  • Custom query
  • Load average on tablet host (per core)
  • Other OS metrics

To that effect, we want:

  • Tablets to always collect self multiple metrics on (on their own host or their designated MySQL server)
  • PRIMARY tablet to always collect all available metrics from replica tablet
  • Metrics should be identifiable by a designated name
  • Throttler check requests (mostly via throttler clients) should be able to specify the list of metrics on which they wish to throttle (e.g. "I care about replication lag, but fine to ignore load average")
  • User should be able to control the list of metrics for VReplication workflows (to be decided exactly how). And specifically for Online DDL. We will likely want to apply the same list of metrics for all workflows (ie we don't need different workflows to each have a different list of metrics on which to throttle)
  • Modifying list of metrics should apply dynamically to running workflows.
  • Throttler configuration should include expected thresholds per metric name.
  • We continue to apply throttler configuration across the keyspace (all tablets in all shards of a given keyspace align on the same single configuration)

Introducing multi-metrics dimension explodes the complexity of the throttler code. However, we are thankfully also able to reduce the complexity by getting rid of dimensions that we don't really use or need, and which were inherited from freno:

  • Clusters: today we use self and shard, but self isn't really a cluster, and the code largely handles it different than shard. We can therefore remove the "cluster" or "store" dimension.

    • Likewise we can also remove the per-cluster configuration overrides.
  • Store types: we only use MySQL, We can remove the dimension.

  • Probe settings: we always probe by tablet, and the probe layer is mostly redundant.

  • Other.

  • We will need to be backwards compatible: multi-metric PRIMARY should work with v19 replicas, and vice versa.

This will cause a major rewrite, with some temporary redundancy code to support backwards compatibility. Hopefully we can simplify some existing complexities inherited from freno, or technical debt we've accumulated since.

Unit tests and endtoend tests will remain (and expand) to protect us against incompatibilities.

@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Throttler labels Apr 3, 2024
@shlomi-noach shlomi-noach self-assigned this Apr 3, 2024
@shlomi-noach shlomi-noach changed the title Tracking: tablet throttler multi-metrics RFC: tablet throttler multi-metrics Apr 4, 2024
@shlomi-noach
Copy link
Contributor Author

Observability: we should be able to track why a certain client was throttled, ie which specific metric it was throttled on.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Apr 4, 2024

Throttler check requests (mostly via throttler clients) should be able to specify the list of metrics on which they wish to throttle (e.g. "I care about replication lag, but fine to ignore load average")

  • The set of metrics specified by the client will AND with each other, ie if the client chooses to throttle based on lag,loadavg then both lag and loadavg need to individually pass for the overall check to pass.

    I don't think it makes sense to OR or to have any other combination.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Apr 16, 2024

As mentioned above, we want to be able to change the list of considered metrics while an Online DDL operation is running (as an example). So that, for example, we want Online DDL to start throttling based on lag and based on load average, or then later on for it to stop throttling based on load average and remain just with lag.

IMO the way to do that is to associate metrics with an app name. All Online DDL operations use the app name "online-ddl". So the way would be to associate "online-ddl": "lag,loadavg".

That association will then either

  • make its way to the throttler client -- which then provides to the throttler the list of metrics its interested in,
  • or, keeping the throttler client ignorant, computed on behalf of the client by the throttler.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 15, 2024

metrics can be collected from the single tablet being probed, or from the collective shard.

  • Replication lag is normally something you wish to collect from the entire shard (including primary), because you want to know about replica's lag. There is a strong reason to check on all shard servers.
  • What about load average? Are you concerned with the load average on the PRIMARY or are you concerned about the metric on replicas? There is no clear answer and you probably want to check on PRIMARY only.

To that effect:

  • A metric is associated with a scope (self/shard). Each metric has a default scope. lag uses shard, others use self.
  • A normal check will use the default scopes (per metric).
  • But the user may also indicate "I wish to check the entire shard for all metrics" or I wish to check self scope for all metrics". In which case we override the metrics' defaults.

Moreover, consider the discussion in previous comment re: associating metrics with apps. It will be even further possible to fine grain the checks by associating "online-ddl": "lag,shard/loadavg". Note:

  • the scope is not mandatory (nothing declared for lag, and so the scope for lag is the default one for this metric, which happens to be shard).
  • per-metric scopes are ignored by the self-checks, which are the mechanism by which the tablets collect their own metrics and by which the PRIMARY tablet collects metrics from the replicas.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 15, 2024

  • Adding support for an all app, which is a catch-all for anything that's doesn't have any specific rules. With all, it is possible to do inverted rules, such as "everything is rejected, except this app which is allowed". Or, "everything throttles at 0.7 ratio for the next 2 hours, except these two apps, one of which is exempted in the next 5 hours, the other throttled at 0.2 ratio for the next 30min". Or also "everything is exempted, but this app needs to go through normal throttling".

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 15, 2024

  • Adding vtctldclient CheckThrottler command, which returns a detailed CheckThrottlerResponse. The command takes a tablet name as argument (potentially also it could take shard name, much like Backup and BackupShard). IT takes --app-name and --scope optional arguments as well as some extra flags.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 16, 2024

Required additions to vtctldclient UpdateThrottlerConfig:

  • Updating the threshold for a given metric name. Setting threshold to 0 will remove the entry.
    We can use the existing --threshold flag, and add --metric-name=... flag. IF the latter exists, then --threshold must be specified. If it does not exist, then we assume the "default" metric.
  • Setting the per app metrics. Something like --app-name=online-ddl --app-metrics=lag,shard/loadavg. The two flags must come together - either both exist, or none exists. It's OK to provide an empty --app-metric, in which case the throttler uses the default metrics for the given app. --app-name must not be empty. It can be "all".

@shlomi-noach
Copy link
Contributor Author

Eventually (v21/v22/v23, depending), we will deprecate these flags in vtctldclient UpdateThrottlerConfig:

  • --check-as-check-self
  • --check-as-check-shard
    We will also clean up these fields from UpdateThrottlerConfigRequest:
  • CheckAsCheckSelf
  • CheckAsCheckShard

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 19, 2024

  • Assigning metrics to "all" app should apply to 'all' app should apply to all apps which do not already have any explicit metrics assigned:
$ vtctldclient UpdateThrottlerConfig --app-name "all" --app-metrics "lag,loadavg" commerce

@shlomi-noach
Copy link
Contributor Author

Addressed by #15988

@shlomi-noach
Copy link
Contributor Author

Base branch PR for changes: planetscale:throttler-multi-metrics-incremental #16012, onto which we will merge multiple incremental PRs.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Jun 19, 2024

Beyond #15988:

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Jul 3, 2024

More beyond #15988:

@shlomi-noach
Copy link
Contributor Author

Reopening as there is a bit of followup.

@shlomi-noach
Copy link
Contributor Author

the utilisation of the connection pools in vttablet is the most reliable "catch-all",

@timvaillancourt circling back to connection pool usage, how do you choose reasonable values?? Do you only throttle when the pool is completely exhausted (ie Available() drops to zero?)
Or otherwise how do you decide that 100 used connections is fine but 120 isn't?

@timvaillancourt
Copy link
Contributor

the utilisation of the connection pools in vttablet is the most reliable "catch-all",

@timvaillancourt circling back to connection pool usage, how do you choose reasonable values?? Do you only throttle when the pool is completely exhausted (ie Available() drops to zero?) Or otherwise how do you decide that 100 used connections is fine but 120 isn't?

@shlomi-noach our plan in txthrottler is to use a percent of pool usage and threshold flag to cause low-priority workloads to be potentially throttled (probabilistic). If the pool usage is below this threshold no workloads are throttled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Throttler Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants