-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it #16456
Conversation
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
/run-all-tests |
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
/// it's used in such pattern: | ||
/// | ||
/// * Only an empty service name is used, representing the status of the | ||
/// whole server.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// whole server.. | |
/// whole server. |
} | ||
|
||
pub struct RaftstoreReporter { | ||
health_controller_inner: Arc<HealthControllerInner>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly use HealthController
directly ?
Or maybe HealthController
has some extra inner traits which should be hidden in the later work ? So implement like this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just to hide the Arc
inside and avoid explicitly writing Arc
when using it, which I think is a common pattern.
pub struct RaftstoreReporterConfig { | ||
pub inspect_interval: Duration, | ||
|
||
pub unsensitive_cause: f64, | ||
pub unsensitive_result: f64, | ||
pub net_io_factor: f64, | ||
|
||
pub cause_spike_filter_value_gauge: IntGauge, | ||
pub cause_spike_filter_count_gauge: IntGauge, | ||
pub cause_l1_gap_gauges: IntGauge, | ||
pub cause_l2_gap_gauges: IntGauge, | ||
|
||
pub result_spike_filter_value_gauge: IntGauge, | ||
pub result_spike_filter_count_gauge: IntGauge, | ||
pub result_l1_gap_gauges: IntGauge, | ||
pub result_l2_gap_gauges: IntGauge, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to have comments to desribe these fieds or related links about the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the params related with SlowTrend can directly refer the definition in components/tikv_utils/src/trend.rs
:
tikv/components/tikv_util/src/trend.rs
Lines 412 to 431 in 5d190fa
// Responsibilities of each window: | |
// | |
// L0: | |
// Eleminate very short time jitter, | |
// Consider its avg value as a point in data flow | |
// L1: | |
// `L0.avg/L1.avg` to trigger slow-event, not last long but high sensitive | |
// Sensitive could be tuned by `L0.duration` and `L1.duration` | |
// Include periodic fluctuations, so it's avg could be seen as baseline | |
// value Its duration is also the no-detectable duration after TiKV starting | |
// L2: | |
// `L1.avg/L2.avg` to trigger slow-event, last long but low sensitive | |
// Sensitive could be tuned by `L1.duration` and `L2.duration` | |
// | |
// L* History: | |
// Sample history values and calculate the margin error | |
// | |
// Spike Filter: | |
// Erase very high and short time spike-values | |
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, these params to SlowTrend will be simplified in the later work. Now, just add the reference to the definition in trend.rs
is good to me.
is_healthy: bool, | ||
} | ||
|
||
impl RaftstoreReporter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the read pool information need to be collected, the pattern is:
- Adding a
ReadPoolReporter
reporter holding aArc
to theHealthController
- Adding related fields inside the
HealthController
right?
Besides, a similar question like above, why holding an inner
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, but it's not an
Arc
toHealthController
. Instead, it should be aArc<HealthControllerInner>
or a clone ofHealthController
. - Separating an "inner" is just to hide the
Arc
inside theHealthController
and avoid explicitly writingArc
when using it, which I think is a common pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of using these stats, do they tolerate network spikes? Like when network latency becomes unstably high, is it possible to report stats that are stale and out-of-order? Could it possibly cause unexpected behavior when using them (in PD and client)?
pd-worker being stuck could also lead to this. But I assume it very unlikely to happen.
Yep, it can be detected by |
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
@@ -1484,53 +1336,40 @@ where | |||
self.remote.spawn(f); | |||
} | |||
|
|||
fn set_slow_trend_to_store_stats( | |||
fn write_slow_trend_metrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn write_slow_trend_metrics( | |
fn flush_slow_trend_metrics( |
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
/merge |
@MyonKeminta: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: cd8d58d
|
@MyonKeminta: Your PR was out of date, I have automatically updated it for you. If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
…Service from PdWorker to it (tikv#16456) ref tikv#16297 Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: dbsid <chenhuansheng@pingcap.com>
What is changed and how it works?
Issue Number: Ref #16297
What's Changed:
Review guide
The majority part of this PR is just refactorying.
Code movements in this PR:
SlowScore
: PdWorker (pd.rs) ->health_controller::slow_score
(health_controller/src/slow_score.rs)SlowTrend
: components/tikv_util/src/trend.rs -> components/health_controller/src/trend.rshealth_controller::reporters::RaftstoreReporter
Runner::on_timeout
in pd.rs ->health_controller::reporters::RaftstoreReporter::tick
Runner::set_slow_trend_to_store_stats
in pd.rs ->health_controller::reporter::RaftstoreReporter::update_slow_trend
(except metrics)health_controller::reporters::SlowTrendStatistics
health_controller::types
HealthService
is now wrapped in theHealthController
.The design and the structure of the health controller is explained in the documents in health_controller/src/lib.rs.
Related changes
pingcap/docs
/pingcap/docs-cn
:Check List
Tests
Side effects
Release note