BR: Restart backup when backup job/pod unexpected failed by k8s #4895

WizardXiao · 2023-02-22T05:14:33Z

What problem does this PR solve?

Closes #4805

What is changed and how does it work?

add check pod/job failed in reconcile, and record retry mark in backup cr.

Code changes

Has Go code change
Has CI related scripts change

Tests

Unit test
E2E test
Manual test
No code

Manual test for exceed maxRetryTimes:

Exponential Backoff Set is :

exponentialBackoffRetryPolicy:
maxRetryTimes: 2
minRetryDuration: 60
retryTimeout: 30

normal running
first time kill in pod
first time retry
pod restart after first time retry
second time kill in pod
second time retry
pod restart after second time retry
thrid time kill in pod
thrid time retry and failed
pod failed after second time retry

Manual test for timeout:

Exponential Backoff Set is :

exponentialBackoffRetryPolicy:
maxRetryTimes: 2
minRetryDuration: 60
retryTimeout: 1

normal running
kill in pod
timeout

Side effects

Breaking backward compatibility
Other side effects:

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

Restart backup when backup job/pod unexpected failed by k8s

…pingcap/tidb-operator into support-snapshot-backup-restart

ti-chi-bot · 2023-02-22T05:14:35Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

WangLe1321
grovecai

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

codecov-commenter · 2023-02-22T05:26:28Z

Codecov Report

Merging #4895 (72d5140) into master (00b4df8) will increase coverage by 8.46%.
The diff coverage is 68.90%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4895      +/-   ##
==========================================
+ Coverage   59.45%   67.91%   +8.46%     
==========================================
  Files         227      231       +4     
  Lines       25828    29188    +3360     
==========================================
+ Hits        15355    19824    +4469     
+ Misses       9014     7845    -1169     
- Partials     1459     1519      +60

Flag	Coverage Δ
e2e	`52.43% <68.28%> (?)`
unittest	`59.06% <17.93%> (-0.40%)`	⬇️

cmd/backup-manager/app/backup/backup.go

pkg/apis/pingcap/v1alpha1/backup.go

pkg/backup/backup.go

pkg/controller/backup/backup_control.go

WizardXiao · 2023-02-23T09:28:52Z

i will refine code later

pkg/controller/backup/backup_controller.go

WizardXiao · 2023-03-02T14:14:25Z

/test pull-e2e-kind pull-e2e-kind-across-kubernetes pull-e2e-kind-tikv-scale-simultaneously pull-e2e-kind-tngm

pkg/backup/backup.go

pkg/controller/backup/backup_controller.go

grovecai

LGTM

Ehco1996

LGTM too, but i am still neebee for operator

ti-chi-bot · 2023-03-06T10:42:41Z

@Ehco1996: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

LGTM too, but i am still neebee for operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

WizardXiao · 2023-03-06T11:16:47Z

LGTM too, but i am still neebee for operator

Thanks

WizardXiao · 2023-03-06T12:37:42Z

/test pull-e2e-kind-across-kubernetes pull-e2e-kind-tikv-scale-simultaneously

WizardXiao · 2023-03-06T13:43:13Z

/test pull-e2e-kind-across-kubernetes

…cap#4895) * init code for test * just clean before backup data * delete test code * import pingcap/errors * add check version * remove test code * add running status check * add restart condition to clarify logic * fix status update * fix ut * init code * update crd reference * fix miss update retry count * add retry limit as constant * init runnable code * refine main controller logic * add some note * address some comments * init e2e test code * add e2e env to extend backup time * add e2e env for test * fix complie * just test kill pod * refine logic * use pkill to kill pod * fix reconcile * add kill pod log * add more log * add more log * try kill pod only * wait and kill running backup pod * add wait for pod failed * fix wait pod running * use killall backup to kill pod * use pkill -9 backup * kill pod until pod is failed * add ps to debug * connect commands by semicolon * kill pod by signal 15 * use panic simulate kill pod * test all kill pod test * remove useless log * add original reason of job or pod failure * rename BackupRetryFailed to BackupRetryTheFailed

* feat: support tiflash backup and restore during volume snapshot (#4812) * feat: calc the backup size from snapshot storage usage (#4819) * fix backup failed when pod was auto restarted by k8s (#4883) * init code for test * just clean before backup data * delete test code * import pingcap/errors * add check version * remove test code * add running status check * add restart condition to clarify logic * fix status update * fix ut * br: ensure pvc names sequential for ebs restore (#4888) * BR: Restart backup when backup job/pod unexpected failed by k8s (#4895) * init code for test * just clean before backup data * delete test code * import pingcap/errors * add check version * remove test code * add running status check * add restart condition to clarify logic * fix status update * fix ut * init code * update crd reference * fix miss update retry count * add retry limit as constant * init runnable code * refine main controller logic * add some note * address some comments * init e2e test code * add e2e env to extend backup time * add e2e env for test * fix complie * just test kill pod * refine logic * use pkill to kill pod * fix reconcile * add kill pod log * add more log * add more log * try kill pod only * wait and kill running backup pod * add wait for pod failed * fix wait pod running * use killall backup to kill pod * use pkill -9 backup * kill pod until pod is failed * add ps to debug * connect commands by semicolon * kill pod by signal 15 * use panic simulate kill pod * test all kill pod test * remove useless log * add original reason of job or pod failure * rename BackupRetryFailed to BackupRetryTheFailed * BR: Auto truncate log backup in backup schedule (#4904) * init schedule log backup code * add run log backup code * update api * refine some nodes * refine cacluate logic * add ut * fix make check * add log backup test * refine code * fix notes * refine function names * fix conflict * fix: add a new check for encryption during the volume snapshot restore (#4914) * br: volume-snapshot may lead to a panic when there is no block change between two snapshot (#4922) * br: refine BackoffRetryPolicy time format (#4925) * refine BackoffRetryPolicy time format * fix some ut --------- Co-authored-by: fengou1 <85682690+fengou1@users.noreply.github.com> Co-authored-by: WangLe1321 <wangle1321@163.com>

WizardXiao and others added 14 commits February 16, 2023 14:21

init code for test

ffc46ec

just clean before backup data

12dfcc1

delete test code

9854c8d

import pingcap/errors

2c69633

add check version

8dfacaf

Merge branch 'master' into support-snapshot-backup-restart

9ad3958

remove test code

c31f6ba

Merge branch 'support-snapshot-backup-restart' of https://github.com/…

0bb3a6b

…pingcap/tidb-operator into support-snapshot-backup-restart

add running status check

acecff3

add restart condition to clarify logic

09bbb33

fix status update

f69d4ef

fix ut

afcf636

init code

47a0ab0

merge master

327ebb2

ti-chi-bot requested review from july2993 and liubog2008 February 22, 2023 05:14

WizardXiao added 2 commits February 22, 2023 14:22

update crd reference

5bad406

fix miss update retry count

6fdf563

WizardXiao commented Feb 22, 2023

View reviewed changes

cmd/backup-manager/app/backup/backup.go Outdated Show resolved Hide resolved

add retry limit as constant

870cb5c

WizardXiao commented Feb 22, 2023

View reviewed changes

pkg/apis/pingcap/v1alpha1/backup.go Outdated Show resolved Hide resolved

WangLe1321 reviewed Feb 23, 2023

View reviewed changes

pkg/backup/backup.go Outdated Show resolved Hide resolved

WangLe1321 reviewed Feb 23, 2023

View reviewed changes

pkg/controller/backup/backup_control.go Outdated Show resolved Hide resolved

init runnable code

4652179

WangLe1321 reviewed Feb 23, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

WangLe1321 reviewed Feb 23, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

refine main controller logic

c24ef65

remove useless log

090b720

WizardXiao changed the title ~~BR: Restart backup when k8s failed to restart~~ BR: Restart backup when backup job/pod unexpected failed by k8s Mar 3, 2023

WizardXiao mentioned this pull request Mar 3, 2023

br: zh: add docs about backoffRetryPolicy and auto truncate log backup by backupschedule pingcap/docs-tidb-operator#2236

Merged

6 tasks

Ehco1996 reviewed Mar 6, 2023

View reviewed changes

pkg/backup/backup.go Show resolved Hide resolved

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

grovecai reviewed Mar 6, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Show resolved Hide resolved

grovecai reviewed Mar 6, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

grovecai reviewed Mar 6, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

grovecai reviewed Mar 6, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

grovecai reviewed Mar 6, 2023

View reviewed changes

pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved

WizardXiao added 2 commits March 6, 2023 17:31

add original reason of job or pod failure

47df27d

rename BackupRetryFailed to BackupRetryTheFailed

6435b16

grovecai approved these changes Mar 6, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 label Mar 6, 2023

Ehco1996 approved these changes Mar 6, 2023

View reviewed changes

Merge branch 'master' into restart-backup-when-k8s-failed-to-restart

72d5140

WangLe1321 approved these changes Mar 6, 2023

View reviewed changes

ti-chi-bot added status/LGT2 and removed status/LGT1 labels Mar 6, 2023

WizardXiao merged commit 98237ee into pingcap:master Mar 6, 2023

WizardXiao mentioned this pull request Mar 10, 2023

dp: cherry pick dp prs to release-1.4 #4929

Merged

10 tasks

ti-chi-bot mentioned this pull request Mar 13, 2023

br: zh: add docs about backoffRetryPolicy and auto truncate log backup by backupschedule (#2236) pingcap/docs-tidb-operator#2266

Merged

6 tasks

Ehco1996 mentioned this pull request Apr 23, 2023

backup auto retry should wait the first pod exit #4980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BR: Restart backup when backup job/pod unexpected failed by k8s #4895

BR: Restart backup when backup job/pod unexpected failed by k8s #4895

WizardXiao commented Feb 22, 2023 •

edited

Loading

ti-chi-bot commented Feb 22, 2023 •

edited

Loading

codecov-commenter commented Feb 22, 2023 •

edited

Loading

WizardXiao commented Feb 23, 2023

WizardXiao commented Mar 2, 2023

grovecai left a comment

Ehco1996 left a comment

ti-chi-bot commented Mar 6, 2023

WizardXiao commented Mar 6, 2023

WizardXiao commented Mar 6, 2023

WizardXiao commented Mar 6, 2023

BR: Restart backup when backup job/pod unexpected failed by k8s #4895

BR: Restart backup when backup job/pod unexpected failed by k8s #4895

Conversation

WizardXiao commented Feb 22, 2023 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

Code changes

Tests

Manual test for exceed maxRetryTimes:

Manual test for timeout:

Side effects

Related changes

Release Notes

ti-chi-bot commented Feb 22, 2023 • edited Loading

codecov-commenter commented Feb 22, 2023 • edited Loading

Codecov Report

WizardXiao commented Feb 23, 2023

WizardXiao commented Mar 2, 2023

grovecai left a comment

Choose a reason for hiding this comment

Ehco1996 left a comment

Choose a reason for hiding this comment

ti-chi-bot commented Mar 6, 2023

WizardXiao commented Mar 6, 2023

WizardXiao commented Mar 6, 2023

WizardXiao commented Mar 6, 2023

WizardXiao commented Feb 22, 2023 •

edited

Loading

ti-chi-bot commented Feb 22, 2023 •

edited

Loading

codecov-commenter commented Feb 22, 2023 •

edited

Loading