Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br: fix backup stuck due to init pod creating stuck #5457

Merged
merged 5 commits into from
Dec 18, 2023

Conversation

WangLe1321
Copy link
Contributor

What problem does this PR solve?

Closes #5456

What is changed and how does it work?

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.


@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2023

Codecov Report

Merging #5457 (59fae01) into master (b061322) will increase coverage by 0.21%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5457      +/-   ##
==========================================
+ Coverage   61.42%   61.63%   +0.21%     
==========================================
  Files         230      241      +11     
  Lines       29152    32977    +3825     
==========================================
+ Hits        17908    20327    +2419     
- Misses       9483    10778    +1295     
- Partials     1761     1872     +111     
Flag Coverage Δ
e2e 21.90% <0.00%> (?)
unittest 61.58% <100.00%> (+0.15%) ⬆️

@@ -244,7 +244,7 @@ func (bm *backupManager) checkVolumeBackupInitializeJobRunning(backup *v1alpha1.
// all the volume snapshots has created
return nil
}
if !v1alpha1.IsVolumeBackupInitialized(backup) || v1alpha1.IsVolumeBackupInitializeFailed(backup) {
if v1alpha1.IsVolumeBackupInitializeFailed(backup) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question. checkVolumeBackupInitializeJobRunning is used to detect init job failures. Based on the "err" returned from this function. IF IsVolumeBackupInitializeFailed == true, then why we return nil, rather than returning an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the aim of this function is to detect init job failure and set backup CR to IsVolumeBackupInitializeFailed. So if IsVolumeBackupInitializeFailed == true, it means we don't need this check. If we return error here, it can block the main logic of the backup controller.

@@ -244,7 +244,7 @@ func (bm *backupManager) checkVolumeBackupInitializeJobRunning(backup *v1alpha1.
// all the volume snapshots has created
return nil
}
if !v1alpha1.IsVolumeBackupInitialized(backup) || v1alpha1.IsVolumeBackupInitializeFailed(backup) {
if v1alpha1.IsVolumeBackupInitializeFailed(backup) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me understand the removal of IsVolumeBackupInitialized. Now if the init job is stuck and eventually ttled, then this method won't find the job, and return "init job deleted before ..." error here

. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the job still exists but it has failed status due to ttl. So we will find job failed and set backup failed here

return controller.IgnoreErrorf("backup %s/%s job was completed or failed, set it VolumeBackupInitializeFailed", ns, name)

In addition, I verified the change by unit test. We can't remove the condition directly because in the first reconcile, the backup job is not created and we will get a 'not found' error. So I modified the condition.

@ti-chi-bot ti-chi-bot bot added size/L and removed size/XS labels Dec 13, 2023
@WangLe1321
Copy link
Contributor Author

/run-pull-e2e-kind-br

@WangLe1321
Copy link
Contributor Author

/run-pull-e2e-kind-br

@ti-chi-bot ti-chi-bot bot removed the lgtm label Dec 14, 2023
@WangLe1321
Copy link
Contributor Author

/run-pull-e2e-kind-br

@WangLe1321
Copy link
Contributor Author

/run-pull-e2e-kind-br

@ti-chi-bot ti-chi-bot bot added the lgtm label Dec 18, 2023
Copy link
Contributor

ti-chi-bot bot commented Dec 18, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BornChanger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Dec 18, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-12-13 10:46:05.69931562 +0000 UTC m=+439456.736542547: ☑️ agreed by BornChanger.
  • 2023-12-14 08:21:11.070191455 +0000 UTC m=+517162.107418382: ✖️🔁 reset by ti-chi-bot[bot].
  • 2023-12-18 09:25:21.693361978 +0000 UTC m=+866612.730588905: ☑️ agreed by BornChanger.

@csuzhangxc csuzhangxc merged commit 533439c into pingcap:master Dec 18, 2023
7 checks passed
@csuzhangxc
Copy link
Member

/cherry-pick release-1.5

@ti-chi-bot
Copy link
Member

@csuzhangxc: new pull request created to branch release-1.5: #5465.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

csuzhangxc pushed a commit that referenced this pull request Dec 18, 2023
Co-authored-by: WangLe1321 <wangle1321@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backup is stuck when the init pod creating stuck and exceed ttl
6 participants