Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BR: Restart backup when backup job/pod unexpected failed by k8s #4895

Merged

Conversation

WizardXiao
Copy link
Contributor

@WizardXiao WizardXiao commented Feb 22, 2023

What problem does this PR solve?

Closes #4805

What is changed and how does it work?

add check pod/job failed in reconcile, and record retry mark in backup cr.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Manual test for exceed maxRetryTimes:

Exponential Backoff Set is :

exponentialBackoffRetryPolicy:
maxRetryTimes: 2
minRetryDuration: 60
retryTimeout: 30

  1. normal running
    2a9d1684-71f8-4c91-ac8c-b75ce62d90c5
  2. first time kill in pod
    2ecfe688-77c7-4b01-a1fc-038a3f8693f2
  3. first time retry
    4aa2e1f7-ea3f-4d34-a363-d277a4caf514
  4. pod restart after first time retry
    87ba6ea1-fb25-4e79-bf7c-5698a71ce60d
  5. second time kill in pod
    c301abf2-b50a-402c-9374-66da643b2da8
  6. second time retry
    2d1f12f1-8c60-4008-8a9c-0f1eafaba158
  7. pod restart after second time retry
    633299d8-ee52-484e-8ab9-1a8e73923d6d
  8. thrid time kill in pod
    5bf69331-6462-4742-8c95-b20c0e164e45
  9. thrid time retry and failed
    485c7688-5a1c-48b9-8b0a-56c37f4ebb3a
  10. pod failed after second time retry
    7a2047e6-589a-4828-a3ad-7110050730e2

Manual test for timeout:

Exponential Backoff Set is :

exponentialBackoffRetryPolicy:
maxRetryTimes: 2
minRetryDuration: 60
retryTimeout: 1

  1. normal running
    image
  2. kill in pod
    512ab645-8b64-4b27-9b16-51a93c734a97
  3. timeout
    458ed276-f3e8-4788-9033-da8f230428f7

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

Restart backup when backup job/pod unexpected failed by k8s

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 22, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • WangLe1321
  • grovecai

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2023

Codecov Report

Merging #4895 (72d5140) into master (00b4df8) will increase coverage by 8.46%.
The diff coverage is 68.90%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4895      +/-   ##
==========================================
+ Coverage   59.45%   67.91%   +8.46%     
==========================================
  Files         227      231       +4     
  Lines       25828    29188    +3360     
==========================================
+ Hits        15355    19824    +4469     
+ Misses       9014     7845    -1169     
- Partials     1459     1519      +60     
Flag Coverage Δ
e2e 52.43% <68.28%> (?)
unittest 59.06% <17.93%> (-0.40%) ⬇️

pkg/backup/backup.go Outdated Show resolved Hide resolved
@WizardXiao
Copy link
Contributor Author

i will refine code later

@WizardXiao
Copy link
Contributor Author

/test pull-e2e-kind pull-e2e-kind-across-kubernetes pull-e2e-kind-tikv-scale-simultaneously pull-e2e-kind-tngm

@WizardXiao WizardXiao changed the title BR: Restart backup when k8s failed to restart BR: Restart backup when backup job/pod unexpected failed by k8s Mar 3, 2023
pkg/backup/backup.go Show resolved Hide resolved
pkg/controller/backup/backup_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@grovecai grovecai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Ehco1996 Ehco1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, but i am still neebee for operator

@ti-chi-bot
Copy link
Member

@Ehco1996: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

LGTM too, but i am still neebee for operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@WizardXiao
Copy link
Contributor Author

LGTM too, but i am still neebee for operator

Thanks

@WizardXiao
Copy link
Contributor Author

/test pull-e2e-kind-across-kubernetes pull-e2e-kind-tikv-scale-simultaneously

@WizardXiao
Copy link
Contributor Author

/test pull-e2e-kind-across-kubernetes

@WizardXiao WizardXiao merged commit 98237ee into pingcap:master Mar 6, 2023
charleszheng44 pushed a commit to charleszheng44/tidb-operator that referenced this pull request Mar 7, 2023
…cap#4895)

* init code for test

* just clean before backup data

* delete test code

* import pingcap/errors

* add check version

* remove test code

* add running status check

* add restart condition to clarify logic

* fix status update

* fix ut

* init code

* update crd reference

* fix miss update retry count

* add retry limit as constant

* init runnable code

* refine main controller logic

* add some note

* address some comments

* init e2e test code

* add e2e env to extend backup time

* add e2e env for test

* fix complie

* just test kill pod

* refine logic

* use pkill to kill pod

* fix reconcile

* add kill pod log

* add more log

* add more log

* try kill pod only

* wait and kill running backup pod

* add wait for pod failed

* fix wait pod running

* use killall backup to kill pod

* use pkill -9 backup

* kill pod until pod is failed

* add ps to debug

* connect commands by semicolon

* kill pod by signal 15

* use panic simulate kill pod

* test all kill pod test

* remove useless log

* add original reason of job or pod failure

* rename BackupRetryFailed to BackupRetryTheFailed
WizardXiao added a commit to WizardXiao/tidb-operator that referenced this pull request Mar 10, 2023
…cap#4895)

* init code for test

* just clean before backup data

* delete test code

* import pingcap/errors

* add check version

* remove test code

* add running status check

* add restart condition to clarify logic

* fix status update

* fix ut

* init code

* update crd reference

* fix miss update retry count

* add retry limit as constant

* init runnable code

* refine main controller logic

* add some note

* address some comments

* init e2e test code

* add e2e env to extend backup time

* add e2e env for test

* fix complie

* just test kill pod

* refine logic

* use pkill to kill pod

* fix reconcile

* add kill pod log

* add more log

* add more log

* try kill pod only

* wait and kill running backup pod

* add wait for pod failed

* fix wait pod running

* use killall backup to kill pod

* use pkill -9 backup

* kill pod until pod is failed

* add ps to debug

* connect commands by semicolon

* kill pod by signal 15

* use panic simulate kill pod

* test all kill pod test

* remove useless log

* add original reason of job or pod failure

* rename BackupRetryFailed to BackupRetryTheFailed
WizardXiao added a commit that referenced this pull request Mar 11, 2023
* feat: support tiflash backup and restore during volume snapshot (#4812)

* feat: calc the backup size from snapshot storage usage (#4819)

* fix backup failed when pod was auto restarted by k8s (#4883)

* init code for test

* just clean before backup data

* delete test code

* import pingcap/errors

* add check version

* remove test code

* add running status check

* add restart condition to clarify logic

* fix status update

* fix ut

* br: ensure pvc names sequential for ebs restore (#4888)

* BR: Restart backup when backup job/pod unexpected failed by k8s (#4895)

* init code for test

* just clean before backup data

* delete test code

* import pingcap/errors

* add check version

* remove test code

* add running status check

* add restart condition to clarify logic

* fix status update

* fix ut

* init code

* update crd reference

* fix miss update retry count

* add retry limit as constant

* init runnable code

* refine main controller logic

* add some note

* address some comments

* init e2e test code

* add e2e env to extend backup time

* add e2e env for test

* fix complie

* just test kill pod

* refine logic

* use pkill to kill pod

* fix reconcile

* add kill pod log

* add more log

* add more log

* try kill pod only

* wait and kill running backup pod

* add wait for pod failed

* fix wait pod running

* use killall backup to kill pod

* use pkill -9 backup

* kill pod until pod is failed

* add ps to debug

* connect commands by semicolon

* kill pod by signal 15

* use panic simulate kill pod

* test all kill pod test

* remove useless log

* add original reason of job or pod failure

* rename BackupRetryFailed to BackupRetryTheFailed

* BR: Auto truncate log backup in backup schedule (#4904)

* init schedule log backup code

* add run log backup code

* update api

* refine some nodes

* refine cacluate logic

* add ut

* fix make check

* add log backup test

* refine code

* fix notes

* refine function names

* fix conflict

* fix: add a new check for encryption during the volume snapshot restore (#4914)

* br: volume-snapshot may lead to a panic when there is no block change between two snapshot (#4922)

* br: refine BackoffRetryPolicy time format (#4925)

* refine BackoffRetryPolicy time format

* fix some ut

---------

Co-authored-by: fengou1 <85682690+fengou1@users.noreply.github.com>
Co-authored-by: WangLe1321 <wangle1321@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setting Restart policy of backup job to onfailure
6 participants