Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br: Fail backup if any snapshots failed #53038

Merged

Conversation

michaelmdeng
Copy link
Contributor

What problem does this PR solve?

During EBS snapshot backup, if any snapshots fail, the backup will be stuck indefinitely checking the snapshots that failed. Since we only run a single backup at a time, this also blocks subsequent backups.

Issue Number: close #53037

Problem Summary: Backup stuck when EBS snapshot failed

What changed and how does it work?

Update snapshot backup process to fail if it detects snapshots in failed state. We observe rare snapshot failures due to underlying EBS hardware issues.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@sre-bot
Copy link
Contributor

sre-bot commented May 6, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

ti-chi-bot bot commented May 6, 2024

Welcome @michaelmdeng!

It looks like this is your first PR to pingcap/tidb 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to pingcap/tidb. 😃

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels May 6, 2024
Copy link

ti-chi-bot bot commented May 6, 2024

Hi @michaelmdeng. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@michaelmdeng michaelmdeng force-pushed the mdeng/error-failed-snapshots branch from 9b88ab9 to b15b738 Compare May 6, 2024 18:19
@michaelmdeng michaelmdeng changed the title [br] Fail backup if any snapshots failed br: Fail backup if any snapshots failed May 6, 2024
Comment on lines +254 to +256
} else if *s.State == ec2.SnapshotStateError {
log.Error("snapshot failed", zap.String("id", *s.SnapshotId), zap.String("error", (*s.StateMessage)))
return 0, errors.Errorf("snapshot %s failed", *s.SnapshotId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Backup metadata is still generated, and contains the snapshot ids. So cleanup should delete the snapshots.

snapIDMap, volAZs, err = ec2Session.CreateSnapshots(backupInfo)
if err != nil {
// TODO maybe we should consider remove snapshots already exists in a failure
return errors.Trace(err)
}
// Step.3 save backup meta file to s3.
// NOTE: maybe define the meta file in kvproto in the future.
// but for now json is enough.
backupInfo.SetClusterVersion(normalizedVer.String())
backupInfo.SetFullBackupType(string(cfg.FullBackupType))
backupInfo.SetResolvedTS(resolvedTs)
backupInfo.SetSnapshotIDs(snapIDMap)
backupInfo.SetVolumeAZs(volAZs)
err = saveMetaFile(c, backupInfo, client.GetStorage())
if err != nil {
return err
}
if !cfg.SkipPauseGCAndScheduler {
log.Info("snapshot started, restore schedule")
if restoreE := goBackupToNormal(ctx); restoreE != nil {
log.Warn("failed to restore removed schedulers, you may need to restore them manually", zap.Error(restoreE))
}
}
log.Info("wait async snapshots finish")
totalSize, err = ec2Session.WaitSnapshotsCreated(snapIDMap, progress)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder though, how much time does it take for AWS to mark the snapshots as failed.

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 8, 2024
Copy link

ti-chi-bot bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BornChanger, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 8, 2024
Copy link

ti-chi-bot bot commented May 8, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-05-08 09:26:00.62802725 +0000 UTC m=+1040514.385162823: ☑️ agreed by YuJuncen.
  • 2024-05-08 09:39:45.780055231 +0000 UTC m=+1041339.537190803: ☑️ agreed by BornChanger.

Copy link

codecov bot commented May 8, 2024

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 74.3287%. Comparing base (b421b72) to head (1447e43).
Report is 23 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #53038        +/-   ##
================================================
+ Coverage   72.0329%   74.3287%   +2.2957%     
================================================
  Files          1499       1510        +11     
  Lines        431222     443332     +12110     
================================================
+ Hits         310622     329523     +18901     
+ Misses       101291      93586      -7705     
- Partials      19309      20223       +914     
Flag Coverage Δ
integration 48.0147% <0.0000%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9957% <ø> (ø)
parser ∅ <ø> (∅)
br 50.6534% <0.0000%> (+12.3075%) ⬆️

Copy link

tiprow bot commented May 8, 2024

@michaelmdeng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow b15b738 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nkg-
Copy link
Contributor

nkg- commented May 8, 2024

@BornChanger : Is the test failure relevant.

@michaelmdeng
Copy link
Contributor Author

@BornChanger : Is the test failure relevant.

Can I get a retest? updated to fix lint errors

@michaelmdeng michaelmdeng force-pushed the mdeng/error-failed-snapshots branch from f402bd4 to c08d578 Compare May 8, 2024 18:07
@ti-chi-bot ti-chi-bot bot merged commit f304f04 into pingcap:master May 8, 2024
21 of 22 checks passed
@nkg-
Copy link
Contributor

nkg- commented May 8, 2024

Please cherrypick to release-6.5 and release-7.5

@BornChanger BornChanger added needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. labels May 8, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #53112.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request May 8, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #53113.

ti-chi-bot bot pushed a commit that referenced this pull request May 13, 2024
terry1purcell pushed a commit to terry1purcell/tidb that referenced this pull request May 17, 2024
RidRisR pushed a commit to RidRisR/tidb that referenced this pull request May 23, 2024
mittalrishabh pushed a commit to mittalrishabh/tidb that referenced this pull request May 30, 2024
close pingcap#53037

Co-authored-by: Michael Deng <33045922+michaelmdeng@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this pull request Jun 3, 2024
@BornChanger BornChanger added the needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. label Jul 29, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #55009.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshot backup stuck when EBS snapshot failed
6 participants