Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it #16456

Merged
merged 19 commits into from
Feb 2, 2024

Conversation

MyonKeminta
Copy link
Contributor

@MyonKeminta MyonKeminta commented Jan 26, 2024

What is changed and how it works?

Issue Number: Ref #16297

What's Changed:

Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it

Review guide

The majority part of this PR is just refactorying.

Code movements in this PR:

  • Implementation of SlowScore: PdWorker (pd.rs) -> health_controller::slow_score (health_controller/src/slow_score.rs)
  • Implementation of SlowTrend: components/tikv_util/src/trend.rs -> components/health_controller/src/trend.rs
  • Coupling logic between SlowScore and SlowTrend and raftstore:
    • health_controller::reporters::RaftstoreReporter
      • Runner::on_timeout in pd.rs -> health_controller::reporters::RaftstoreReporter::tick
      • Runner::set_slow_trend_to_store_stats in pd.rs -> health_controller::reporter::RaftstoreReporter::update_slow_trend (except metrics)
    • health_controller::reporters::SlowTrendStatistics
    • health_controller::types
  • Accesses to gRPC's HealthService is now wrapped in the HealthController.

The design and the structure of the health controller is explained in the documents in health_controller/src/lib.rs.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Copy link
Contributor

ti-chi-bot bot commented Jan 26, 2024

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • LykxSassinator
  • cfzjywxk

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 26, 2024
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@MyonKeminta
Copy link
Contributor Author

/run-all-tests

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 31, 2024
@MyonKeminta MyonKeminta marked this pull request as ready for review January 31, 2024 11:17
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2024
@MyonKeminta MyonKeminta changed the title *: Add module health_controller and move SlowScore, SlowTrend, HeanthService from PdWorker to it *: Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it Jan 31, 2024
@cfzjywxk cfzjywxk requested a review from tonyxuqqi February 1, 2024 03:14
Copy link
Contributor

@LykxSassinator LykxSassinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

/// it's used in such pattern:
///
/// * Only an empty service name is used, representing the status of the
/// whole server..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// whole server..
/// whole server.

}

pub struct RaftstoreReporter {
health_controller_inner: Arc<HealthControllerInner>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly use HealthController directly ?

Or maybe HealthController has some extra inner traits which should be hidden in the later work ? So implement like this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to hide the Arc inside and avoid explicitly writing Arc when using it, which I think is a common pattern.

Comment on lines 18 to 34
pub struct RaftstoreReporterConfig {
pub inspect_interval: Duration,

pub unsensitive_cause: f64,
pub unsensitive_result: f64,
pub net_io_factor: f64,

pub cause_spike_filter_value_gauge: IntGauge,
pub cause_spike_filter_count_gauge: IntGauge,
pub cause_l1_gap_gauges: IntGauge,
pub cause_l2_gap_gauges: IntGauge,

pub result_spike_filter_value_gauge: IntGauge,
pub result_spike_filter_count_gauge: IntGauge,
pub result_l1_gap_gauges: IntGauge,
pub result_l2_gap_gauges: IntGauge,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have comments to desribe these fieds or related links about the design.

Copy link
Contributor

@LykxSassinator LykxSassinator Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the params related with SlowTrend can directly refer the definition in components/tikv_utils/src/trend.rs:

// Responsibilities of each window:
//
// L0:
// Eleminate very short time jitter,
// Consider its avg value as a point in data flow
// L1:
// `L0.avg/L1.avg` to trigger slow-event, not last long but high sensitive
// Sensitive could be tuned by `L0.duration` and `L1.duration`
// Include periodic fluctuations, so it's avg could be seen as baseline
// value Its duration is also the no-detectable duration after TiKV starting
// L2:
// `L1.avg/L2.avg` to trigger slow-event, last long but low sensitive
// Sensitive could be tuned by `L1.duration` and `L2.duration`
//
// L* History:
// Sample history values and calculate the margin error
//
// Spike Filter:
// Erase very high and short time spike-values
//

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, these params to SlowTrend will be simplified in the later work. Now, just add the reference to the definition in trend.rs is good to me.

is_healthy: bool,
}

impl RaftstoreReporter {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the read pool information need to be collected, the pattern is:

  • Adding a ReadPoolReporter reporter holding a Arc to the HealthController
  • Adding related fields inside the HealthController

right?

Besides, a similar question like above, why holding an inner object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, but it's not an Arc to HealthController. Instead, it should be a Arc<HealthControllerInner> or a clone of HealthController.
  2. Separating an "inner" is just to hide the Arc inside the HealthController and avoid explicitly writing Arc when using it, which I think is a common pattern.

@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label Feb 1, 2024
Copy link
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of using these stats, do they tolerate network spikes? Like when network latency becomes unstably high, is it possible to report stats that are stale and out-of-order? Could it possibly cause unexpected behavior when using them (in PD and client)?

pd-worker being stuck could also lead to this. But I assume it very unlikely to happen.

@LykxSassinator
Copy link
Contributor

LykxSassinator commented Feb 1, 2024

In terms of using these stats, do they tolerate network spikes? Like when network latency becomes unstably high, is it possible to report stats that are stale and out-of-order? Could it possibly cause unexpected behavior when using them (in PD and client)?

pd-worker being stuck could also lead to this. But I assume it very unlikely to happen.

Yep, it can be detected by SlowTrendStatistics::record(). As it describes, it can catch up the normal jitters on network-io, Although current implementation and strategy on the detection of network-io jitters is not sensitive, it has the basic capability of covering this kind of exceptions, and the optimization on net-io jitters can be tracked in the later work.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Copy link
Contributor

@LykxSassinator LykxSassinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@@ -1484,53 +1336,40 @@ where
self.remote.spawn(f);
}

fn set_slow_trend_to_store_stats(
fn write_slow_trend_metrics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn write_slow_trend_metrics(
fn flush_slow_trend_metrics(

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Feb 2, 2024
@MyonKeminta
Copy link
Contributor Author

/merge

Copy link
Contributor

ti-chi-bot bot commented Feb 2, 2024

@MyonKeminta: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Copy link
Contributor

ti-chi-bot bot commented Feb 2, 2024

This pull request has been accepted and is ready to merge.

Commit hash: cd8d58d

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Feb 2, 2024
Copy link
Contributor

ti-chi-bot bot commented Feb 2, 2024

@MyonKeminta: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot merged commit 00a2518 into tikv:master Feb 2, 2024
7 checks passed
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone Feb 2, 2024
@MyonKeminta MyonKeminta deleted the m/add-perf-feedback branch February 2, 2024 09:41
dbsid pushed a commit to dbsid/tikv that referenced this pull request Mar 24, 2024
…Service from PdWorker to it (tikv#16456)

ref tikv#16297

Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: dbsid <chenhuansheng@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants