Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebs br: add resilience to single TiKV crash #5572

Merged
merged 5 commits into from
Mar 20, 2024

Conversation

BornChanger
Copy link
Contributor

@BornChanger BornChanger commented Mar 13, 2024

What problem does this PR solve?

This PR provides the resilience to a single TiKV crash due to wal corruption at restore

What is changed and how does it work?

Add a new attribute spec.tolerateSingleTiKVOutage to restore CR. When it's set to true, restore to the tc can tolerate a single TiKV outage due to wal log corruption issue.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.


@ti-chi-bot ti-chi-bot bot requested a review from howardlau1999 March 13, 2024 15:43
@ti-chi-bot ti-chi-bot bot added the size/L label Mar 13, 2024
@BornChanger BornChanger changed the title ebs br: add support to do wal check during warmup and resilience to s… ebs br: add support to do wal check during warmup and resilience to single TiKV crash Mar 13, 2024
@BornChanger BornChanger force-pushed the tolerate_single_wal_corruption branch from 5a1fe13 to b34d39f Compare March 13, 2024 15:56
Signed-off-by: BornChanger <dawn_catcher@126.com>
Signed-off-by: BornChanger <dawn_catcher@126.com>
@BornChanger BornChanger force-pushed the tolerate_single_wal_corruption branch from b34d39f to ade2e19 Compare March 20, 2024 10:00
@ti-chi-bot ti-chi-bot bot added size/S and removed size/L labels Mar 20, 2024
@BornChanger BornChanger changed the title ebs br: add support to do wal check during warmup and resilience to single TiKV crash ebs br: add resilience to single TiKV crash Mar 20, 2024
Signed-off-by: BornChanger <dawn_catcher@126.com>
@ti-chi-bot ti-chi-bot bot added size/M and removed size/S labels Mar 20, 2024
@BornChanger
Copy link
Contributor Author

/run-pull-e2e-kind-br

Signed-off-by: BornChanger <dawn_catcher@126.com>
@BornChanger
Copy link
Contributor Author

/run-pull-e2e-kind-br

Copy link
Contributor

ti-chi-bot bot commented Mar 20, 2024

@YuJuncen: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

ti-chi-bot bot commented Mar 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: WangLe1321, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the lgtm label Mar 20, 2024
Copy link
Contributor

ti-chi-bot bot commented Mar 20, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-03-20 11:23:45.247044419 +0000 UTC m=+1462252.269290805: ☑️ agreed by WangLe1321.
  • 2024-03-20 12:32:23.587982292 +0000 UTC m=+1466370.610228680: ✖️🔁 reset by ti-chi-bot[bot].

Copy link
Contributor

ti-chi-bot bot commented Mar 20, 2024

New changes are detected. LGTM label has been removed.

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 61.51%. Comparing base (10ece31) to head (be4d0ec).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5572      +/-   ##
==========================================
- Coverage   61.52%   61.51%   -0.02%     
==========================================
  Files         235      235              
  Lines       30314    30314              
==========================================
- Hits        18651    18647       -4     
- Misses       9796     9798       +2     
- Partials     1867     1869       +2     
Flag Coverage Δ
unittest 61.51% <0.00%> (-0.02%) ⬇️

@csuzhangxc csuzhangxc merged commit 5f4207d into pingcap:master Mar 20, 2024
5 of 6 checks passed
@csuzhangxc
Copy link
Member

/cherry-pick release-1.5

@ti-chi-bot
Copy link
Member

@csuzhangxc: new pull request created to branch release-1.5: #5585.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@csuzhangxc: new pull request could not be created: failed to create pull request against pingcap/tidb-operator#release-1.5 from head ti-chi-bot:cherry-pick-5572-to-release-1.5: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-5572-to-release-1.5."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request"}

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

csuzhangxc pushed a commit that referenced this pull request Mar 20, 2024
Signed-off-by: BornChanger <dawn_catcher@126.com>
Co-authored-by: BornChanger <dawn_catcher@126.com>
if len(tc.Status.TiKV.Stores) != int(tc.Spec.TiKV.Replicas) {
func (tc *TidbCluster) AllTiKVsAreAvailable(tolerateSingleTiKVOutage bool) bool {
if (!tolerateSingleTiKVOutage && len(tc.Status.TiKV.Stores) != int(tc.Spec.TiKV.Replicas)) ||
(tolerateSingleTiKVOutage && int(tc.Spec.TiKV.Replicas-1) != len(tc.Status.TiKV.Stores)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BornChanger
would this check return false if we enable tolerateSingleTiKVOutage and don't observe a WAL corruption (ie. replicas == stores)? should it be stores >= replicas - 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a common fix but to work-around a single TiKV node crashloop. So, the process is that we observed one TiKV crashes due to the wal corruption issue, THEN turn on tolerateSingleTiKVOutage of the restore CR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants