Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the store limit according to the cluster state dynamically #1902

Merged
merged 59 commits into from
Dec 30, 2019

Conversation

shafreeck
Copy link
Contributor

@shafreeck shafreeck commented Nov 6, 2019

Notice: This PR only could be merged after #1875 and #1887

What problem does this PR solve?

Adjust the store limit according to the cluster state dynamically

One of the main job of PD is to keep the data balanced on all TiKV nodes. The scheduler in PD moves regions between TiKVs when the unbalance is detected. In the case of adding or removing a TiKV node, the unbalance occurs, and the scheduler tries to move a lot amount of data. To avoid overloading a TiKV node, PD limits the speed to move regions for each TiKV node, the strategy is called store-limit.

The store-limit is a fixed value which is 15 regions per minute by default. This value is conservative, it prevents the TiKV from overloading in almost all cases. However, the value is too low for an idle cluster which may have no or little load. It takes too much time to balance the data when adding or removing a node. This PR tries to solve this problem by introducing the measurement of cluster state and adjust the store-limit dynamically according to the cluster state.

What is changed and how it works?

  • Leverage PR Calculate the state of the cluster using the CPU usages reported by TiKV #1875, we have basic statistics for the cluster state which define the cluster as idle if there are no keys or bytes accessed or the CPU usage is low.
  • The state of the cluster is calculated when receiving a store heartbeat, and the store limit is updated according to the latest state.
  • A configuration called StoreLimitScene is added for configuring the store-limit for different states, the valid states could be: idle, low, normal, high. The configuration can be updated by the HTTP API or pd-ctl.
  • Add a new subcommand for store: store limit-scene [<scene> <rate>]

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)

Code changes

  • Has HTTP API interfaces change

Related changes

  • Need to update the documentation
  • Need to be included in the release notes

@shafreeck shafreeck added type/enhancement The issue or PR belongs to an enhancement. DNM do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 6, 2019
@shafreeck shafreeck self-assigned this Nov 6, 2019
@shafreeck
Copy link
Contributor Author

/run-all-tests

@shafreeck shafreeck changed the title WIP: Adjust the store limit according to the cluster state dynamically Adjust the store limit according to the cluster state dynamically Nov 7, 2019
@shafreeck shafreeck removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2019
@shafreeck shafreeck force-pushed the schedule-in-cases branch 4 times, most recently from 6e7f819 to a776c31 Compare November 8, 2019 12:49
@codecov-io
Copy link

codecov-io commented Nov 21, 2019

Codecov Report

Merging #1902 into master will decrease coverage by 0.04%.
The diff coverage is 69.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1902      +/-   ##
==========================================
- Coverage   76.79%   76.75%   -0.05%     
==========================================
  Files         186      188       +2     
  Lines       19049    19162     +113     
==========================================
+ Hits        14628    14707      +79     
- Misses       3305     3335      +30     
- Partials     1116     1120       +4
Impacted Files Coverage Δ
server/server.go 82.98% <ø> (+0.38%) ⬆️
server/cluster/cluster_stat.go 79.56% <100%> (ø)
server/config/config.go 86.17% <100%> (+0.46%) ⬆️
server/schedule/store_limit.go 86.95% <100%> (ø) ⬆️
server/schedule/store_limit_scenes.go 100% <100%> (ø)
server/cluster/metrics.go 100% <100%> (ø) ⬆️
server/handler.go 50.42% <100%> (+0.21%) ⬆️
server/api/router.go 99.23% <100%> (+0.01%) ⬆️
server/cluster/store_limiter.go 53.65% <53.65%> (ø)
server/cluster/cluster.go 81.16% <60%> (-0.13%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a607642...8c251f7. Read the comment docs.

@shafreeck shafreeck removed the DNM label Nov 22, 2019
@shafreeck shafreeck force-pushed the schedule-in-cases branch 2 times, most recently from 55d1677 to c767ec4 Compare December 5, 2019 12:09
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
state

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
The store limit is wrapped by a struct called StoreLimit,
it exports the useful api of the ratelimit.Bucket. It records
the limit's setting mode which can be "auto" or "manual". An "manual"
set value can overwrite an "auto set value, otherwise it is forbidden.

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
server/cluster/store_limiter.go Show resolved Hide resolved
server/schedule/store_limit_scenes.go Show resolved Hide resolved
server/cluster/metrics.go Outdated Show resolved Hide resolved
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
@shafreeck
Copy link
Contributor Author

@rleungx @lhy1024 PTAL

Copy link
Contributor

@lhy1024 lhy1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Do we need to add a simple comparison test result under this PR?

@shafreeck
Copy link
Contributor Author

LGTM. Do we need to add a simple comparison test result under this PR?

Do you have any ideas about the test?

Copy link
Member

@rleungx rleungx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
@shafreeck
Copy link
Contributor Author

/run-all-tests

@shafreeck shafreeck added the status/can-merge Indicates a PR has been approved by a committer. label Dec 27, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Dec 27, 2019

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Dec 27, 2019

@shafreeck merge failed.

@shafreeck
Copy link
Contributor Author

/run-all-tests

1 similar comment
@shafreeck
Copy link
Contributor Author

/run-all-tests

@nolouch
Copy link
Contributor

nolouch commented Dec 28, 2019

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Dec 28, 2019

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Dec 28, 2019

@shafreeck merge failed.

@nolouch
Copy link
Contributor

nolouch commented Dec 30, 2019

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Dec 30, 2019

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Dec 30, 2019

@shafreeck merge failed.

@shafreeck
Copy link
Contributor Author

/run-all-tests

@shafreeck shafreeck merged commit 73e4f0d into tikv:master Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/can-merge Indicates a PR has been approved by a committer. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants