Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/*: add trend to check etcd healthy #7682

Closed
wants to merge 2 commits into from

Conversation

HuSharp
Copy link
Member

@HuSharp HuSharp commented Jan 9, 2024

What problem does this PR solve?

Issue Number: Ref #7251

What is changed and how does it work?

  • introduce the trend struct which has 2 windows, related to slow node detection
    • small windows duration is 20s, big windows duration is 60s
    • rate = (small_windows_avg - big_windows_avg - unSenseVal)/small_windows_avg
    • rate indicates the input's changing rate
    • The average value is from fsync which is recorded in etcd's server side
    • unSenseVal related to the environment, which is calculated by all average fsync val
  • event been triggered by 2 aspects: etcd leader changed & rate less than point
    • put the check rate at the campaign leader
    • record rate every 1 seconds
  • We have 4 formulas:
    • (A-B)/B >= m, (rate)
    • Sa = Ax, Sb = nBx, (n is a multiple of window big for window small, x is small window duration)
    • Sb = δ + Sa
    • δ = (big_dur-small_dur)*base (base determined by environment)
      We can get -> δ/Sb = δ/(δ+Sa) <= (m+1-n) => (big_dur-small_dur) * base / (base * bid_dur +spike * t) <= (m+1-n)
  • for now, we want to monitor spike/base = 50, which means needs to grow 100 times what before
    • we can get: (big_dur-small_dur)b / (small_durb + 100 * t * b) <= (m+1-n)/n
    • for now: big_dur=60, small_dur=20, n=3, s=50, and spike times be 3: 40b/(60b+50 * 3 * b) <= (m-2) -> m=2.19
    • I set n = 3. and when n is fixed increasing m means increasing the difficulty of triggering

rate graph
image
fsync graph
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Release note

None.

Copy link
Contributor

ti-chi-bot bot commented Jan 9, 2024

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Copy link
Contributor

ti-chi-bot bot commented Jan 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 9, 2024
@ti-chi-bot ti-chi-bot bot requested review from JmPotato and rleungx January 9, 2024 09:09
@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 9, 2024
@HuSharp HuSharp removed request for JmPotato and rleungx January 9, 2024 09:10
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch 6 times, most recently from 917406f to bb948c4 Compare January 9, 2024 11:13
@HuSharp HuSharp changed the title lease: add trend to check etcd server/*: add trend to check etcd healthy Jan 9, 2024
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch from bb948c4 to 7d0da1b Compare January 9, 2024 23:46
Copy link

codecov bot commented Jan 9, 2024

Codecov Report

Merging #7682 (65f5f51) into master (a90e13e) will decrease coverage by 0.06%.
Report is 2 commits behind head on master.
The diff coverage is 82.95%.

❗ Current head 65f5f51 differs from pull request most recent head de02922. Consider uploading reports for the commit de02922 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7682      +/-   ##
==========================================
- Coverage   73.82%   73.77%   -0.06%     
==========================================
  Files         429      432       +3     
  Lines       47542    47668     +126     
==========================================
+ Hits        35100    35165      +65     
- Misses       9458     9492      +34     
- Partials     2984     3011      +27     
Flag Coverage Δ
unittests 73.77% <82.95%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch 4 times, most recently from 6c4c64f to e6a5e29 Compare January 10, 2024 07:54
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 10, 2024
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch 5 times, most recently from 4b6993f to 0789d3e Compare January 11, 2024 03:49
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch 8 times, most recently from 009de55 to 3756ffd Compare January 11, 2024 15:57
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch 10 times, most recently from 7916105 to fa54400 Compare January 18, 2024 00:35
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch 2 times, most recently from 3eb0bea to 87f57b9 Compare January 18, 2024 08:07
Signed-off-by: husharp <jinhao.hu@pingcap.com>
@HuSharp HuSharp force-pushed the check_unhealthy_lease_test branch from 87f57b9 to 3607af1 Compare January 18, 2024 08:17
Signed-off-by: husharp <jinhao.hu@pingcap.com>
@HuSharp
Copy link
Member Author

HuSharp commented Feb 1, 2024

Fixed by #7737.

@HuSharp HuSharp closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant