Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vttablet throttling #6668

Merged
merged 36 commits into from
Oct 1, 2020
Merged

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Sep 3, 2020

Context: #6661

This PR introduces a vttablet throttler. Based on https://github.com/github/freno, and adapted to vitess, a vttablet throttler runs on a MASTER tablet, and probes and serves throttling on its own cluster (ie keyspace+shard). The throttler only considers REPLICA servers in the topology (ie it skips RDONLY).

A throttler also runs on replicas, but does nothing, until such time that the replica becomes a MASTER.

The logic as imported from freno is more general purpose. Large parts of the code were stripped out. Some generalization still exists (e.g. the notion of multiple clusters, even though a throttler in this PR only handles a single cluster).

The code is tightly integrated with vitess: leadership determined by Tablet.Type; detection of REPLICA servers done directly via topo.Server, etc.

Ongoing work:

  • invoke the throttler
  • serve HTTP (to be used e.g. by pt-online-schema-change, gh-ost)
  • serve functionally (to be used, eg. by table purging logic (future work))
  • manual, static configuration
  • add to tabletmanager, tablet state manager
  • auto-generate MySQL account with random password whenever turns Leader
  • hard code replication lag query to read _vt.heartbeat:
select unix_timestamp(now(6))-max(ts/1000000000) from _vt.heartbeat

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach marked this pull request as draft September 3, 2020 09:56
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach changed the title WIP: vttablet throlling WIP: vttablet throttling Sep 3, 2020
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
- tracking `lastCheckTimeNano`
- isDormant(): infrequent sampling when no app runs checks
- upon transitioning to leader (table type is `MASTER`), create a MySQL account with random password and reconfigure throttler to use that password
  - ensure to re-attempt user creation until successful

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
- tracking `lastCheckTimeNano`
- isDormant(): infrequent sampling when no app runs checks
- upon transitioning to leader (table type is `MASTER`), create a MySQL account with random password and reconfigure throttler to use that password
  - ensure to re-attempt user creation until successful

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

This is now pretty much feature complete and ready to be used (notwithstanding endtoend tests)

General behavior description:

  • throttler runs on all tablets as part of TabletServer
  • but is only actually doing the real work on a MASTER tablet
  • upon transitioning to MASTER, or on first time, throttler creates or modifies a MySQL vt_tablet_throttler account with random password
  • every 10 seconds, review the roster of REPLICA tablets via topo.Server
  • continuously poll said REPLICA tablets for replication lag
  • serve /throttler/check API endpoint
    • clients will optionally add ?p=low for low priority checks. gh-ost, pt-online-schema-change will do so
    • clients will optionally identify themselves via ?app=thename
  • poll frequency is high if clients issues a check API calls recently, and sparse if no checks have been made recently

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Sep 6, 2020

Question: how should the throttler respond if there are no REPLICA tablets in the topology? Current behavior is to throttle. Actually, it's a form of error, here's the response in such a scenario:

$ curl -s http://sn-carbon:15101/throttler/check | jq .
{
  "StatusCode":500,
  "Value":0,
  "Threshold":1,
  "Message":"No hosts found"
}

should we instead return a OK? Does it make sense?

@shlomi-noach shlomi-noach marked this pull request as ready for review September 6, 2020 05:55
@shlomi-noach shlomi-noach changed the title WIP: vttablet throttling vttablet throttling Sep 6, 2020
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

endtoend tests are good.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Sep 6, 2020

API examples:

on MASTER tablet

$ curl -s http://sn-carbon:15100/throttler/status | jq .
{
  "Keyspace": "commerce",
  "Shard": "0",
  "IsLeader": true,
  "IsOpen": true,
  "IsDormant": false,
  "AggregatedMetrics": {
    "mysql/local": {
      "Value": 0.074892
    }
  },
  "MetricsHealth": {}
}
curl -s http://sn-carbon:15100/throttler/check | jq -c
{"StatusCode":200,"Value":0.107066,"Threshold":1,"Message":""}
$ vtctl  -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global  StopReplication zone1-0000000101 
$ curl -s http://sn-carbon:15100/throttler/check | jq -c
{"StatusCode":429,"Value":3.14063,"Threshold":1,"Message":"Threshold exceeded"}
vtctl  -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global  ChangeTabletType zone1-0000000101 RDONLY
# wait...
$ curl -s http://sn-carbon:15100/throttler/check | jq -c
{"StatusCode":500,"Value":0,"Threshold":1,"Message":"No hosts found"}
$ vtctl  -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global  StartReplication zone1-0000000101 
$ vtctl  -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global  ChangeTabletType zone1-0000000101 REPLICA
# wait...
$ curl -s http://sn-carbon:15100/throttler/check | jq -c
{"StatusCode":200,"Value":0.113081,"Threshold":1,"Message":""}

on REPLICA tablet

Only the MASTER tablet is actually probing servers and serves throttle checks. Replicas will always reject throttling checks.

$ curl -s http://sn-carbon:15101/throttler/status | jq .
{
  "Keyspace": "commerce",
  "Shard": "0",
  "IsLeader": false,
  "IsOpen": true,
  "IsDormant": true,
  "AggregatedMetrics": {},
  "MetricsHealth": {}
}
$ curl -s http://sn-carbon:15101/throttler/check | jq -c
{"StatusCode":404,"Value":0,"Threshold":0,"Message":"No such metric"}

@shlomi-noach
Copy link
Contributor Author

This PR assumes heartbeat is injected. I suggest merging #6665 before this PR.

…ing a bug where CREATE USER hangs on MySQL 8.0.21

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

Code refactored for different behavior on Open(), Close(): tickers now suspend work when the tabletserver is Close()d.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Sep 24, 2020

ping, request for review 🙏

@shlomi-noach
Copy link
Contributor Author

Per @sougou , if there's no replicas, the throttler should report HTTP 200 OK; current behavior is to report 500. We can merge this PR as it is, and I'll quickly iterate in a followup PR. This feature is still unused by any process at this time.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

OK, updated the code to return HTTP 200 when no replicas are found.

@dweitzman
Copy link
Member

Just a drive-by observation: seems like it might be worthwhile to be able to configure whether rdonly tablets are considered when looking at replication delay.

Many of us are running with only master and rdonly tablet types, since the replica vs rdonly distinction wasn't critical for us but rdonly is traditionally required for operations like shard splits.

For those of us with only master and rdonly tablets and no replica, we probably tend to use rdonly in a way that youtube might have used replica and do have an interest in being able to throttle based on rdonly lag

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
go/vt/vttablet/tabletserver/throttle/mysql/instance_key.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/base/util.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/base/http.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/throttler.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/base/app_throttle.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/throttler.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/tabletserver.go Show resolved Hide resolved
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Sep 29, 2020

Introduced -throttle_tablet_types command line argument:

  • default is "replica"
  • the user may configure e.g. -throttle_tablet_types="replica,rdonly"

it's a comma delimited list of tablet types which the throttler checks for lag. Tablet types not in this list are completely ignored. This answers @dweitzman 's comment

assert.Equal(t, http.StatusNotFound, resp.StatusCode)
}

func TestThrottlerAfterMetricsCollected(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for users to configure the polling interval?

go/vt/throttler/throttler_test.go Show resolved Hide resolved
go/vt/vttablet/tabletserver/state_manager.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/state_manager.go Outdated Show resolved Hide resolved
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Member

@harshit-gangal harshit-gangal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shlomi-noach
Copy link
Contributor Author

Throttler now enables heartbeat writer, regardless of heartbeat_enable value. This means we don't need #6665 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants