Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow node detection: enable evict-slow-trend on raft-kv2. #6945

Merged
merged 24 commits into from
Aug 23, 2023

Conversation

LykxSassinator
Copy link
Contributor

@LykxSassinator LykxSassinator commented Aug 10, 2023

What problem does this PR solve?

Issue Number: Close #6868, Ref tikv/tikv#15271 and Close tikv/tikv#15267

What is changed and how does it work?

This pr contains:
+ Enable `evict-slow-trend` scheduler on `raft-kv2` as default.
+ Optimize the detection strategy on the perception of Network I/O delays on TiKV nodes.

Records when building a cluster with raft-kv-2 engine(multi-rocksdb):

[2023/08/04 07:33:40.560 +00:00] [INFO] [store_config.go:231] ["sync the store config successful"] [store-address=10.233.108.15:20180] [store-config="{\n  \"coprocessor\": {\n    \"region-max-size\": \"15GiB\",\n    \"region-split-size\": \"10GiB\",\n    \"region-max-keys
\": 153600000,\n    \"region-split-keys\": 102400000,\n    \"enable-region-bucket\": true,\n    \"region-bucket-size\": \"50MiB\"\n  },\n  \"storage\": {\n    \"engine\": \"raft-kv2\"\n  }\n}"] [old-config="{\n  \"coprocessor\": {\n    \"region-max-size\": \"\",\n    \"re
gion-split-size\": \"\",\n    \"region-max-keys\": 0,\n    \"region-split-keys\": 0,\n    \"enable-region-bucket\": false,\n    \"region-bucket-size\": \"\"\n  },\n  \"storage\": {\n    \"engine\": \"\"\n  }\n}"]
[2023/08/04 07:33:40.561 +00:00] [INFO] [cluster.go:440] ["create scheduler"] [scheduler-name=evict-slow-trend-scheduler] [scheduler-args="[]"]
[2023/08/04 07:33:40.561 +00:00] [INFO] [cluster.go:446] ["add scheduler successfully"] [scheduler-name=evict-slow-trend] [scheduler-args="[]"]

And we got the following metrics:
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

This pr contains:
+ Enable `evict-slow-trend` scheduler on `raft-kv2` as default.
+ Opitimize the detection strategy on the perception of Network I/O delays on TiKV nodes.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 10, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • bufferflies
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Aug 10, 2023
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 10, 2023

Hi @LykxSassinator. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot requested review from rleungx and Yisaer August 10, 2023 08:29
@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 10, 2023
@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2023
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2023
@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Merging #6945 (b29d736) into master (ebceb83) will increase coverage by 0.02%.
The diff coverage is 84.25%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6945      +/-   ##
==========================================
+ Coverage   74.23%   74.26%   +0.02%     
==========================================
  Files         433      433              
  Lines       45860    45904      +44     
==========================================
+ Hits        34046    34092      +46     
- Misses       8802     8805       +3     
+ Partials     3012     3007       -5     
Flag Coverage Δ
unittests 74.26% <84.25%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@LykxSassinator
Copy link
Contributor Author

/check-issue-triage-complete

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 11, 2023
@LykxSassinator
Copy link
Contributor Author

/cc @bufferflies PTAL, thx

pkg/schedule/schedulers/evict_slow_trend.go Show resolved Hide resolved
pkg/mcs/scheduling/server/config/config.go Outdated Show resolved Hide resolved
pkg/schedule/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
pkg/schedule/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/cluster/cluster.go Outdated Show resolved Hide resolved
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Aug 23, 2023
@LykxSassinator
Copy link
Contributor Author

/test

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 23, 2023

@LykxSassinator: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LykxSassinator
Copy link
Contributor Author

/run test

@nolouch
Copy link
Contributor

nolouch commented Aug 23, 2023

/merge

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 23, 2023

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 23, 2023

This pull request has been accepted and is ready to merge.

Commit hash: b29d736

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 23, 2023
@nolouch
Copy link
Contributor

nolouch commented Aug 23, 2023

/test build

@ti-chi-bot ti-chi-bot bot merged commit 1743552 into tikv:master Aug 23, 2023
23 checks passed
@LykxSassinator LykxSassinator deleted the enable-slow-trend-v2 branch August 23, 2023 10:50

storeSlowTrendMiscGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "pd",
Subsystem: "scheduler",
Name: "store_slow_trend_misc",
Help: "Store trend internal uncatalogued values",
}, []string{"type"})
Help: "Store trend internal uncatelogued values",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's a spelling error. I'll tidy and clean up all misleading metrics in the next pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
6 participants