fix backup failed when pod was auto restarted by k8s #4883

WizardXiao · 2023-02-16T06:25:50Z

What problem does this PR solve?

Ref #4805

What is changed and how does it work?

Code changes

Has Go code change
Has CI related scripts change

Tests

Unit test
E2E test
Manual test
No code

k8s may restart backup pod which is not done.

drain node, node's pod will be schedule to other node
delete pod when job still exist.

detail test

drain node test

backup is running at worker2, drain worker2, backup restart at worker3 by k8s.
clean data before run br command to backup data
backup success

delete pod test

backup pod was deleted and then job restart a new pod
clean data before run br command to backup data in the new pod
backup success

Side effects

Breaking backward compatibility
Other side effects:

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

fix backup failed when pod was auto restarted by k8s

ti-chi-bot · 2023-02-16T06:25:52Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

fengou1
grovecai

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

codecov-commenter · 2023-02-16T06:36:53Z

Codecov Report

Merging #4883 (854305e) into master (c4e1c3a) will decrease coverage by 6.69%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4883      +/-   ##
==========================================
- Coverage   59.43%   52.75%   -6.69%     
==========================================
  Files         226      205      -21     
  Lines       25694    25836     +142     
==========================================
- Hits        15272    13630    -1642     
- Misses       8966    10646    +1680     
- Partials     1456     1560     +104

Flag	Coverage Δ
e2e	`52.75% <100.00%> (?)`
unittest	`?`

…pingcap/tidb-operator into support-snapshot-backup-restart

cmd/backup-manager/app/backup/backup.go

WizardXiao · 2023-02-21T06:31:10Z

/test pull-e2e-kind pull-e2e-kind-tikv-scale-simultaneously

grovecai

LGTM, pls add e2e test

WizardXiao · 2023-02-21T13:38:51Z

/test pull-e2e-kind pull-e2e-kind-across-kubernetes pull-e2e-kind-basic pull-e2e-kind-serial pull-e2e-kind-tngm

WizardXiao · 2023-02-21T20:46:45Z

/test pull-e2e-kind pull-e2e-kind-across-kubernetes pull-e2e-kind-basic pull-e2e-kind-serial

WizardXiao · 2023-02-21T22:21:00Z

/test pull-e2e-kind pull-e2e-kind-across-kubernetes

WizardXiao · 2023-02-21T23:21:37Z

/test pull-e2e-kind-across-kubernetes

* init code for test * just clean before backup data * delete test code * import pingcap/errors * add check version * remove test code * add running status check * add restart condition to clarify logic * fix status update * fix ut

* feat: support tiflash backup and restore during volume snapshot (#4812) * feat: calc the backup size from snapshot storage usage (#4819) * fix backup failed when pod was auto restarted by k8s (#4883) * init code for test * just clean before backup data * delete test code * import pingcap/errors * add check version * remove test code * add running status check * add restart condition to clarify logic * fix status update * fix ut * br: ensure pvc names sequential for ebs restore (#4888) * BR: Restart backup when backup job/pod unexpected failed by k8s (#4895) * init code for test * just clean before backup data * delete test code * import pingcap/errors * add check version * remove test code * add running status check * add restart condition to clarify logic * fix status update * fix ut * init code * update crd reference * fix miss update retry count * add retry limit as constant * init runnable code * refine main controller logic * add some note * address some comments * init e2e test code * add e2e env to extend backup time * add e2e env for test * fix complie * just test kill pod * refine logic * use pkill to kill pod * fix reconcile * add kill pod log * add more log * add more log * try kill pod only * wait and kill running backup pod * add wait for pod failed * fix wait pod running * use killall backup to kill pod * use pkill -9 backup * kill pod until pod is failed * add ps to debug * connect commands by semicolon * kill pod by signal 15 * use panic simulate kill pod * test all kill pod test * remove useless log * add original reason of job or pod failure * rename BackupRetryFailed to BackupRetryTheFailed * BR: Auto truncate log backup in backup schedule (#4904) * init schedule log backup code * add run log backup code * update api * refine some nodes * refine cacluate logic * add ut * fix make check * add log backup test * refine code * fix notes * refine function names * fix conflict * fix: add a new check for encryption during the volume snapshot restore (#4914) * br: volume-snapshot may lead to a panic when there is no block change between two snapshot (#4922) * br: refine BackoffRetryPolicy time format (#4925) * refine BackoffRetryPolicy time format * fix some ut --------- Co-authored-by: fengou1 <85682690+fengou1@users.noreply.github.com> Co-authored-by: WangLe1321 <wangle1321@163.com>

init code for test

ffc46ec

ti-chi-bot requested review from csuzhangxc and liubog2008 February 16, 2023 06:25

WizardXiao added 2 commits February 17, 2023 13:45

just clean before backup data

12dfcc1

delete test code

9854c8d

WizardXiao changed the title ~~[WIP] support restart when backup pod evicted~~ [WIP] support backup pod auto restart by k8s Feb 17, 2023

WizardXiao added 2 commits February 17, 2023 14:41

import pingcap/errors

2c69633

add check version

8dfacaf

WizardXiao changed the title ~~[WIP] support backup pod auto restart by k8s~~ support backup pod auto restart by k8s Feb 20, 2023

WizardXiao and others added 3 commits February 20, 2023 08:57

Merge branch 'master' into support-snapshot-backup-restart

9ad3958

remove test code

c31f6ba

Merge branch 'support-snapshot-backup-restart' of https://github.com/…

0bb3a6b

…pingcap/tidb-operator into support-snapshot-backup-restart

WizardXiao changed the title ~~support backup pod auto restart by k8s~~ fix backup failed when pod was auto restarted by k8s Feb 20, 2023

fengou1 self-requested a review February 20, 2023 02:08

WangLe1321 reviewed Feb 20, 2023

View reviewed changes

cmd/backup-manager/app/backup/backup.go Outdated Show resolved Hide resolved

WizardXiao added 3 commits February 20, 2023 15:38

add running status check

acecff3

add restart condition to clarify logic

09bbb33

fix status update

f69d4ef

fengou1 approved these changes Feb 20, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 label Feb 20, 2023

fix ut

afcf636

grovecai approved these changes Feb 21, 2023

View reviewed changes

ti-chi-bot added status/LGT2 and removed status/LGT1 labels Feb 21, 2023

Merge branch 'master' into support-snapshot-backup-restart

854305e

WizardXiao merged commit babb7d8 into master Feb 22, 2023

WizardXiao mentioned this pull request Mar 10, 2023

dp: cherry pick dp prs to release-1.4 #4929

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix backup failed when pod was auto restarted by k8s #4883

fix backup failed when pod was auto restarted by k8s #4883

WizardXiao commented Feb 16, 2023 •

edited

Loading

ti-chi-bot commented Feb 16, 2023 •

edited

Loading

codecov-commenter commented Feb 16, 2023 •

edited

Loading

WizardXiao commented Feb 21, 2023

grovecai left a comment

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 21, 2023

fix backup failed when pod was auto restarted by k8s #4883

fix backup failed when pod was auto restarted by k8s #4883

Conversation

WizardXiao commented Feb 16, 2023 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

Code changes

Tests

k8s may restart backup pod which is not done.

detail test

Side effects

Related changes

Release Notes

ti-chi-bot commented Feb 16, 2023 • edited Loading

codecov-commenter commented Feb 16, 2023 • edited Loading

Codecov Report

WizardXiao commented Feb 21, 2023

grovecai left a comment

Choose a reason for hiding this comment

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 21, 2023

WizardXiao commented Feb 16, 2023 •

edited

Loading

ti-chi-bot commented Feb 16, 2023 •

edited

Loading

codecov-commenter commented Feb 16, 2023 •

edited

Loading