diff --git a/content/en/docs/reference/features/tablet-throttler.md b/content/en/docs/reference/features/tablet-throttler.md new file mode 100644 index 000000000..12d805fbf --- /dev/null +++ b/content/en/docs/reference/features/tablet-throttler.md @@ -0,0 +1,169 @@ +--- +title: Tablet throttler +aliases: ['/docs/user-guides/tablet-throttler/','/docs/reference/tablet-throttler/'] +--- + +VTTablet runs a cooperative throttling service. This service probes the shard's MySQL topology and observes replication lag on servers. This throttler is derived from GitHub's [freno](https://github.com/github/freno). + +_Note: the Vitess documentation is transitioning from the term "Master" (with regard to MySQL replication) to "Primary". this document reflects this transition._ + +## Why throttler: maintaining low replication lag + +Vitess uses MySQL with asynchronous or semi-synchronous replication. In these modes, each shard has a primary instance that applies changes and logs them to the binary log. The replicas for that shard will get binary log entries from the primary, potentially acknowledge them (if semi-synchronous replication is enabled), and apply them. A running replica normally applies the entries as soon as possible, unless it is stopped or configured to delay. However, if the replica is busy, then it may not have the resources to apply events in a timely fashion, and can therefore start lagging. For example, if the replica is serving traffic, it may lack the necessary disk I/O or CPU to avoid lagging behind the primary. + +Maintaining low replication lag is important in production for two reasons: + +- A lagging replica may not be representative of the data on the primary. Reads from the replica reflect data that is not consistent with the data on the primary. This is noticeable on web services following read-after-write from the replica, and this can produce results not reflecting the write. +- An up-to-date replica makes for a good failover experience. If all replicas are lagging, then a failover process must choose between waiting for a replica to catch up or losing data. + +Some common database operations include mass writes to the database, including the following: + +- Online schema migrations duplicating entire tables +- Mass population of columns, such as populating the new column with derived values following an `ADD COLUMN` migration +- Purging of old data +- Purging of tables as part of safe table `DROP` operation + +These operations can easily incur replication lag. However, these operations are typically not time-limited. It is possible to rate-limit them to reduce database load. + +This is where a throttler becomes useful. A throttler can detect when replication lag is low, a cluster is healthy, and operations can proceed. It can also detect when replication lag is high and advise applications to hold the next operation. + +Applications are expected to break down their tasks into small sub-tasks. For example, instead of deleting `1,000,000` rows, an application should only delete `50` at a time. Between these sub-tasks, the application should check in with the throttler. + +The throttler is only intended for use with operations such as the above mass write cases. It should not be used for ongoing, normal OLTP queries. + +## Throttler overview + +Each `vttablet` runs an internal throttler service, and provides API endpoints to the throttler. Only the primary throttler is doing actual work at any given time. The throttlers on the replicas are mostly dormant, and wait for their turn to become "leaders," such as when the tablet transitions into `MASTER` (primary) type. + +The primary tablet's throttler continuously does the following things: + +- The throttler confirms it is still the primary tablet for its shard. +- Every `10sec`, the throttler uses the topology server to refresh the shard's tablets list. +- The throttler probes all `REPLICA` tablets for their replication lag. This is done by querying the `_vt.heartbeat` table. + - The throttler begins in dormant probe mode. As long as no application or client is actually looking for metrics, it probes the servers at multi-second intervals. + - When applications check for throttle advice, the throttler begins probing servers in subsecond intervals. It reverts to dormant probe mode if no requests are made in the duration of `1min`. +- The throttler aggregates the last probed values from all relevant tablets. This is _the cluster's metric _. + +The cluster's metric is only as accurate as the following metrics: + +- The probe interval +- The heartbeat injection interval +- The aggregation interval + +The error margin equals approximately the sum of the above values, plus additional overhead. The defaults for these intervals are as follows: ++ Probe interval: `100ms` ++ Aggregation interval: `100ms` ++ Heartbeat interval: `250ms` + +The user may override the heartbeat interval by sending `-heartbeat_interval` flag to `vttablet`. + +Thus, the aggregated interval can be off, by default, by some `500ms`. This makes it inaccurate for evaluations that require high resolution lag evaluation. This resolution is sufficient for throttling purposes. + +The throttler allows clients and applications to `check` for throttle advice. The check is an `HTTP` request, `HEAD` method, or `GET` method. Throttler returns one of the following HTTP response codes as an answer: + +- `200` (OK): The application may write to the data store. This is the desired response. +- `404` (Not Found): The check contains an unknown metric name. This can take place immediately upon startup or immediately after failover, and should resolve within 10 seconds. +- `417` (Expectation Failed): The requesting application is explicitly forbidden to write. The throttler does not implement this at this time. +- `429` (Too Many Requests): Do not write. A normal, expected state indicating there is replication lag. This is the hint for applications or clients to withhold writes. +- `500` (Internal Server Error): An internal error has occurred. Do not write. + +Normally, apps will see either `200` or `429`. An app should only ever proceed to write to the database when it receives a `200` response code. + +The throttler chooses the response by comparing the replication lag with a pre-defined _threshold_. If the lag is lower than the threshold, response can be `200` (OK). If the lag is higher than the threshold, the response would be `429` (Too Many Requests). + +The throttler only collects and evaluates lag on a set of predefined tablet types. By default, this tablet type set is `REPLICA`. See [Configuration](#Configuration). + +When the throttler sees no relevant replicas in the shard, it allows writes by responding with `HTTP 200 OK`. + +## Configuration + + +- The throttler is currently disabled by default. Use the `vttablet` option `-enable-lag-throttler` to enable the throttler. + When the throttler is disabled, it still serves `/throttler/check` API and responds with `HTTP 200 OK` to all requests. + When the throttler is enabled, it implicitly also runs heartbeat injections. +- Use the `vttablet` flag `-throttle_threshold` to set a lag threshold value. The default threshold is `1sec` and is set upon tablet startup. For example, to set a half-second lag threshold, use the flag `-throttle_threshold=0.5s`. + + + +- To set the tablet types that the throttler queries for lag, use the `vttablet` flag `-throttle_tablet_types="replica,rdonly"`. The default tablet type is `replica`; this type is always implicitly included in the tablet types list. You may add any other tablet type. Any type not specified is ignored by the throttler. + +## API & usage + +Applicaitons use the API `/throttler/check`. + +- Applications may indicate their identity via `?app=` parameter. +- Applications may also declare themselves to be _low priority_ via `?p=low` param. Managed online schema migrations (`gh-ost`, `pt-online-schema-change`) do so, as does the table purge process. + +Examples: + +- `gh-ost` uses this throttler endpoint: `/throttler/check?app=gh-ost&p=low` +- A data backfill application may use this parameter: `/throttler/check?app=backfill` (using _normal_ priority) + +A `HEAD` request is sufficient. A `GET` request also provides a `JSON` output. For example: + +- `{"StatusCode":200,"Value":0.207709,"Threshold":1,"Message":""}` +- `{"StatusCode":429,"Value":3.494452,"Threshold":1,"Message":"Threshold exceeded"}` +- `{"StatusCode":404,"Value":0,"Threshold":0,"Message":"No such metric"}` + +In the first two above examples we can see that the tablet is configured to throttle at `1sec` + +Tablet also provides `/throttler/status` endpoint. This is useful for monitoring and management purposes. + +**Example: Healthy primary tablet** + +The following command gets throttler status on a tablet hosted on `tablet1`, serving on port `15100`. + +```shell +$ curl -s http://tablet1:15100/throttler/status | jq . +``` + +This API call returns the following JSON object: + +```json +{ + "Keyspace": "commerce", + "Shard": "80-c0", + "IsLeader": true, + "IsOpen": true, + "IsDormant": false, + "AggregatedMetrics": { + "mysql/local": { + "Value": 0.193576 + } + }, + "MetricsHealth": {} +} + +``` + + +`"IsLeader": true` indicates this tablet is active, is the `primary`, and is running probes. +`"IsDormant": false,` means that an application has recently issued a `check`, and the throttler is probing for lag at high frequency. + +**Example: replica tablet** + +The following command gets throttler status on a tablet hosted on `tablet2`, serving on port `15100`. + +```shell +$ curl -s http://tablet2:15100/throttler/status | jq . +``` + +This API call returns the following JSON object: + +```json +{ + "Keyspace": "commerce", + "Shard": "80-c0", + "IsLeader": false, + "IsOpen": true, + "IsDormant": true, + "AggregatedMetrics": {}, + "MetricsHealth": {} +} +``` + + +## Resources + +- [freno](https://github.com/github/freno) project page +- [Mitigating replication lag and reducing read load with freno](https://github.blog/2017-10-13-mitigating-replication-lag-and-reducing-read-load-with-freno/), a GitHub Engineering blog post