-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
br: fix backup stuck due to init pod creating stuck #5457
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5457 +/- ##
==========================================
+ Coverage 61.42% 61.63% +0.21%
==========================================
Files 230 241 +11
Lines 29152 32977 +3825
==========================================
+ Hits 17908 20327 +2419
- Misses 9483 10778 +1295
- Partials 1761 1872 +111
|
pkg/backup/backup/backup_manager.go
Outdated
@@ -244,7 +244,7 @@ func (bm *backupManager) checkVolumeBackupInitializeJobRunning(backup *v1alpha1. | |||
// all the volume snapshots has created | |||
return nil | |||
} | |||
if !v1alpha1.IsVolumeBackupInitialized(backup) || v1alpha1.IsVolumeBackupInitializeFailed(backup) { | |||
if v1alpha1.IsVolumeBackupInitializeFailed(backup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question. checkVolumeBackupInitializeJobRunning is used to detect init job failures. Based on the "err" returned from this function. IF IsVolumeBackupInitializeFailed == true, then why we return nil, rather than returning an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the aim of this function is to detect init job failure and set backup CR to IsVolumeBackupInitializeFailed
. So if IsVolumeBackupInitializeFailed == true
, it means we don't need this check. If we return error here, it can block the main logic of the backup controller.
pkg/backup/backup/backup_manager.go
Outdated
@@ -244,7 +244,7 @@ func (bm *backupManager) checkVolumeBackupInitializeJobRunning(backup *v1alpha1. | |||
// all the volume snapshots has created | |||
return nil | |||
} | |||
if !v1alpha1.IsVolumeBackupInitialized(backup) || v1alpha1.IsVolumeBackupInitializeFailed(backup) { | |||
if v1alpha1.IsVolumeBackupInitializeFailed(backup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me understand the removal of IsVolumeBackupInitialized. Now if the init job is stuck and eventually ttled, then this method won't find the job, and return "init job deleted before ..." error here
return updateErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the job still exists but it has failed status due to ttl. So we will find job failed and set backup failed here
return controller.IgnoreErrorf("backup %s/%s job was completed or failed, set it VolumeBackupInitializeFailed", ns, name) |
In addition, I verified the change by unit test. We can't remove the condition directly because in the first reconcile, the backup job is not created and we will get a 'not found' error. So I modified the condition.
/run-pull-e2e-kind-br |
/run-pull-e2e-kind-br |
/run-pull-e2e-kind-br |
/run-pull-e2e-kind-br |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BornChanger The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/cherry-pick release-1.5 |
@csuzhangxc: new pull request created to branch In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Closes #5456
What is changed and how does it work?
Code changes
Tests
Side effects
Related changes
Release Notes
Please refer to Release Notes Language Style Guide before writing the release note.