Lightning: increase backoff if split fails #49518

mittalrishabh · 2023-12-16T05:35:47Z

What problem does this PR solve?

Issue Number: close #49517

Problem Summary:
Lightning job is failing intermittently after enable BR on prod clusters because of batch split failure. During the backup process, BR temporarily disables the schedulers for a period of 1-2 minutes. However, the Lightning job, which does not have a backoff mechanism, continues to retry the process for up to 5 times without pausing or spacing out the retries.

What changed and how does it work?

i am adding a exponential back off before each retry and now it will retry upto 930 seconds before if fails the job

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects
NO

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation
NO

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2023-12-16T05:35:57Z

Hi @mittalrishabh. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tiprow · 2023-12-16T05:36:04Z

Hi @mittalrishabh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lance6716 · 2023-12-16T12:37:23Z

/ok-to-test

codecov · 2023-12-16T12:51:31Z

Codecov Report

Attention: Patch coverage is 20.00000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 67.2674%. Comparing base (2065be4) to head (dd863e2).
Report is 2326 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #49518        +/-   ##
================================================
- Coverage   71.9600%   67.2674%   -4.6926%     
================================================
  Files          1438       2558      +1120     
  Lines        345749     849351    +503602     
================================================
+ Hits         248801     571337    +322536     
- Misses        76712     254230    +177518     
- Partials      20236      23784      +3548

Flag	Coverage Δ
integration	`37.2742% <20.0000%> (?)`
unit	`79.2959% <20.0000%> (+7.3358%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`73.6130% <ø> (+17.3001%)`	⬆️
parser	`∅ <ø> (∅)`
br	`70.6383% <20.0000%> (+18.9987%)`	⬆️

br/pkg/lightning/backend/local/local.go

D3Hunter · 2023-12-19T02:52:44Z

some background about this pr: #49517 (comment)

D3Hunter · 2023-12-19T03:44:40Z

there are already backoff in region split， just not long enough

tidb/br/pkg/lightning/backend/local/localhelper.go

Lines 171 to 174 in 0719329

    
           waitTime *= 2 
        
           if waitTime > retrySplitMaxWaitTime { 
        
           	waitTime = retrySplitMaxWaitTime 
        
           }

mittalrishabh · 2023-12-19T04:27:23Z

there are already backoff in region split， just not long enough

tidb/br/pkg/lightning/backend/local/localhelper.go

Lines 171 to 174 in 0719329

waitTime *= 2

if waitTime > retrySplitMaxWaitTime {

waitTime = retrySplitMaxWaitTime

}

It is only for 4 seconds. BR can suspend split from 60 seconds to 10 min. And i don't want to increase this backoff time as it is going to add backoff in each region split which will be very expensive.

mittalrishabh · 2024-01-02T23:40:15Z

can you please review this PR

lance6716

rest lgtm

br/pkg/lightning/backend/local/local.go

mittalrishabh · 2024-01-03T19:49:30Z

Let me know if you agree with 10 second initial back off

lance6716

just noticed that SplitAndScatterRegionByRanges has a maximum 4 seconds sleep for 8 times, so I think here we sleep 10 seconds is good.

mittalrishabh · 2024-01-05T12:48:33Z

I need lgtm label to checkin

lance6716 · 2024-01-08T02:03:36Z

PTAL @D3Hunter

ti-chi-bot · 2024-01-08T04:08:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, lance6716

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~br/pkg/lightning/OWNERS~~ [D3Hunter,lance6716]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-01-08T04:08:26Z

[LGTM Timeline notifier]

Timeline:

2024-01-05 02:20:23.331794411 +0000 UTC m=+2396314.369021322: ☑️ agreed by lance6716.
2024-01-08 04:08:25.378208741 +0000 UTC m=+243494.962462429: ☑️ agreed by D3Hunter.

ti-chi-bot · 2024-01-08T04:50:03Z

@mittalrishabh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
idc-jenkins-ci-tidb/check_dev_2	`dd863e2`	link	unknown	`/test check-dev2`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

lance6716 · 2024-01-08T05:02:51Z

/retest

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2024-01-08T05:10:06Z

In response to a cherrypick label: new pull request created to branch release-6.5: #50164.

close pingcap#49517

close #49517

ti-chi-bot · 2024-03-20T05:42:26Z

In response to a cherrypick label: new pull request created to branch release-7.5: #51929.

close #49517

ti-chi-bot · 2024-10-28T08:05:25Z

In response to a cherrypick label: new pull request created to branch release-7.1: #56874.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

close #49517

ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Dec 16, 2023

D3Hunter reviewed Dec 19, 2023

View reviewed changes

br/pkg/lightning/backend/local/local.go Outdated Show resolved Hide resolved

ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 19, 2023

mittalrishabh changed the title ~~add backoff if split fails~~ Lightning: add backoff if split fails Dec 19, 2023

ti-chi-bot bot added do-not-merge/needs-triage-completed needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. and removed do-not-merge/invalid-title do-not-merge/needs-triage-completed labels Dec 19, 2023

lance6716 reviewed Jan 3, 2024

View reviewed changes

br/pkg/lightning/backend/local/local.go Outdated Show resolved Hide resolved

mittalrishabh added 3 commits January 2, 2024 19:41

add backoff if split fails

bf7aff8

review comments

dbc7e22

review comments

dd863e2

lance6716 approved these changes Jan 5, 2024

View reviewed changes

ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 5, 2024

D3Hunter approved these changes Jan 8, 2024

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 8, 2024

D3Hunter changed the title ~~Lightning: add backoff if split fails~~ Lightning: increase backoff if split fails Jan 8, 2024

ti-chi-bot bot merged commit 2a564d4 into pingcap:master Jan 8, 2024
27 of 28 checks passed

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 8, 2024

This is an automated cherry-pick of pingcap#49518

de4f01d

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Jan 8, 2024

Lightning: increase backoff if split fails (#49518) #50164

Merged

13 tasks

AilinKid pushed a commit to AilinKid/tidb that referenced this pull request Jan 17, 2024

Lightning: increase backoff if split fails (pingcap#49518)

846fb38

close pingcap#49517

ti-chi-bot bot pushed a commit that referenced this pull request Jan 25, 2024

Lightning: increase backoff if split fails (#49518) (#50164)

6f6386a

close #49517

Benjamin2037 added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Mar 20, 2024

ti-chi-bot mentioned this pull request Mar 20, 2024

Lightning: increase backoff if split fails (#49518) #51929

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this pull request May 10, 2024

Lightning: increase backoff if split fails (#49518) (#51929)

5678c38

close #49517

ti-chi-bot bot added the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Oct 28, 2024

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Oct 28, 2024

This is an automated cherry-pick of pingcap#49518

e4c593c

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Oct 28, 2024

Lightning: increase backoff if split fails (#49518) #56874

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this pull request Nov 11, 2024

Lightning: increase backoff if split fails (#49518) (#56874)

884b46e

close #49517

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightning: increase backoff if split fails #49518

Lightning: increase backoff if split fails #49518

mittalrishabh commented Dec 16, 2023

ti-chi-bot bot commented Dec 16, 2023

tiprow bot commented Dec 16, 2023

lance6716 commented Dec 16, 2023

codecov bot commented Dec 16, 2023 •

edited

Loading

D3Hunter commented Dec 19, 2023

D3Hunter commented Dec 19, 2023 •

edited

Loading

mittalrishabh commented Dec 19, 2023

mittalrishabh commented Jan 2, 2024

lance6716 left a comment

mittalrishabh commented Jan 3, 2024

lance6716 left a comment

mittalrishabh commented Jan 5, 2024

lance6716 commented Jan 8, 2024

ti-chi-bot bot commented Jan 8, 2024

ti-chi-bot bot commented Jan 8, 2024

ti-chi-bot bot commented Jan 8, 2024

lance6716 commented Jan 8, 2024

ti-chi-bot commented Jan 8, 2024

ti-chi-bot commented Mar 20, 2024

ti-chi-bot commented Oct 28, 2024

Lightning: increase backoff if split fails #49518

Lightning: increase backoff if split fails #49518

Conversation

mittalrishabh commented Dec 16, 2023

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

ti-chi-bot bot commented Dec 16, 2023

tiprow bot commented Dec 16, 2023

lance6716 commented Dec 16, 2023

codecov bot commented Dec 16, 2023 • edited Loading

Codecov Report

D3Hunter commented Dec 19, 2023

D3Hunter commented Dec 19, 2023 • edited Loading

mittalrishabh commented Dec 19, 2023

mittalrishabh commented Jan 2, 2024

lance6716 left a comment

Choose a reason for hiding this comment

mittalrishabh commented Jan 3, 2024

lance6716 left a comment

Choose a reason for hiding this comment

mittalrishabh commented Jan 5, 2024

lance6716 commented Jan 8, 2024

ti-chi-bot bot commented Jan 8, 2024

ti-chi-bot bot commented Jan 8, 2024

[LGTM Timeline notifier]

ti-chi-bot bot commented Jan 8, 2024

lance6716 commented Jan 8, 2024

ti-chi-bot commented Jan 8, 2024

ti-chi-bot commented Mar 20, 2024

ti-chi-bot commented Oct 28, 2024

codecov bot commented Dec 16, 2023 •

edited

Loading

D3Hunter commented Dec 19, 2023 •

edited

Loading