-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightning: increase backoff if split fails #49518
Conversation
Hi @mittalrishabh. Thanks for your PR. I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @mittalrishabh. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #49518 +/- ##
================================================
- Coverage 71.9600% 67.2674% -4.6926%
================================================
Files 1438 2558 +1120
Lines 345749 849351 +503602
================================================
+ Hits 248801 571337 +322536
- Misses 76712 254230 +177518
- Partials 20236 23784 +3548
Flags with carried forward coverage won't be shown. Click here to find out more.
|
some background about this pr: #49517 (comment) |
there are already backoff in region split, just not long enough tidb/br/pkg/lightning/backend/local/localhelper.go Lines 171 to 174 in 0719329
|
It is only for 4 seconds. BR can suspend split from 60 seconds to 10 min. And i don't want to increase this backoff time as it is going to add backoff in each region split which will be very expensive. |
can you please review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
Let me know if you agree with 10 second initial back off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed that SplitAndScatterRegionByRanges
has a maximum 4 seconds sleep for 8 times, so I think here we sleep 10 seconds is good.
I need lgtm label to checkin |
PTAL @D3Hunter |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: D3Hunter, lance6716 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mittalrishabh: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created to branch |
In response to a cherrypick label: new pull request created to branch |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
What problem does this PR solve?
Issue Number: close #49517
Problem Summary:
Lightning job is failing intermittently after enable BR on prod clusters because of batch split failure. During the backup process, BR temporarily disables the schedulers for a period of 1-2 minutes. However, the Lightning job, which does not have a backoff mechanism, continues to retry the process for up to 5 times without pausing or spacing out the retries.
What changed and how does it work?
i am adding a exponential back off before each retry and now it will retry upto 930 seconds before if fails the job
Check List
Tests
Side effects
NO
Documentation
NO
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.