fix: error in the log when no pd member fails #4608

hoyhbx · 2022-06-30T19:46:34Z

What problem does this PR solve?

When failovering a broken PD member, this line gives us an error even if no PD member fails (i.e. pdDeletedFailureReplicas=0). When we inspect the error log, we find that the error message “PD failover replicas (0) reaches the limit (0), skip failover” is constantly written to the operator log and is very confusing to users. However, there is actually no error if no PD member fails. We believe when pdDeletedFailureReplicas=0, the operator should not throw the error message “PD failover replicas (0) reaches the limit (0), skip failover“.
This PR provides a straightforward fix for this problem.

What is changed and how does it work?

We fix it by adding one more condition in the if statement in this line to check whether pdDeletedFailureReplicas is larger than 0, as illustrated below:

if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount && pdDeletedFailureReplicas > 0 {
	klog.Errorf(“PD failover replicas (%d) reaches the limit (%d), skip failover”, pdDeletedFailureReplicas, *tc.Spec.PD.MaxFailoverCount)
	return nil
}

Code changes

Has Go code change
Has CI related scripts change

Tests

Unit test
E2E test
Manual test
No code
This is a simple fix and we suppose no test above is needed.

Side effects

Breaking backward compatibility
Other side effects:

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

NONE

ti-chi-bot · 2022-06-30T19:46:36Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

DanielZhangQD
KanShiori

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

CLAassistant · 2022-06-30T19:46:41Z

All committers have signed the CLA.

ti-chi-bot · 2022-06-30T19:51:13Z

@hoyhbx: GitHub didn't allow me to request PR reviews from the following users: reviewer.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @Reviewer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DanielZhangQD · 2022-07-04T07:49:04Z

pkg/manager/member/pd_failover.go

@@ -66,7 +66,7 @@ func (f *pdFailover) Failover(tc *v1alpha1.TidbCluster) error {
 	}

 	pdDeletedFailureReplicas := tc.GetPDDeletedFailureReplicas()
-	if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount {
+	if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount && pdDeletedFailureReplicas > 0 {


For your case, you have set Spec.PD.MaxFailoverCount=0, right? Which means that you want to disable failover for PD, in this case, I think we should update the code here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/pd_member_manager.go#L241 as below:
if m.deps.CLIConfig.AutoFailover && tc.Spec.PD.MaxFailoverCount != nil && *tc.Spec.PD.MaxFailoverCount > 0 {
WDYT?

@DanielZhangQD Thanks for your reply. Yes, we think that your suggestion is correct, and we have updated the code in the latest commit in this PR now.

DanielZhangQD

LGTM

codecov-commenter · 2022-07-08T03:43:49Z

Codecov Report

Merging #4608 (9b906a5) into master (af0fda7) will increase coverage by 9.64%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4608      +/-   ##
==========================================
+ Coverage   62.65%   72.30%   +9.64%     
==========================================
  Files         186      190       +4     
  Lines       20849    23342    +2493     
==========================================
+ Hits        13063    16877    +3814     
+ Misses       6589     5273    -1316     
+ Partials     1197     1192       -5

Flag	Coverage Δ
e2e	`58.99% <0.00%> (?)`
unittest	`62.65% <0.00%> (-0.01%)`	⬇️

KanShiori · 2022-07-08T03:51:37Z

pkg/manager/member/pd_member_manager.go

@@ -238,7 +238,7 @@ func (m *pdMemberManager) syncPDStatefulSetForTidbCluster(tc *v1alpha1.TidbClust
 		return err
 	}

-	if m.deps.CLIConfig.AutoFailover {
+	if m.deps.CLIConfig.AutoFailover && tc.Spec.PD.MaxFailoverCount != nil && *tc.Spec.PD.MaxFailoverCount > 0 {


What happens if failover occurs and set the max failover count to 0?

can not arrive the Recover func?

The failover Pods will not be scaled in and the user needs to remove the failure member manually in the status.pd then the failover Pods will be scaled in.

@hoyhbx After checking the logic of the other component, I think @KanShiori 's above comment is reasonable, we should be able to recover the previous failover Pods even if we set MaxFailoverCount to 0.

Could you please add the check in L244?

Sure, I have changed it. @DanielZhangQD

…xErrorLog

DanielZhangQD · 2022-07-25T03:17:09Z

/run-all-tests

pkg/manager/member/pd_member_manager.go

DanielZhangQD · 2022-08-02T06:56:28Z

/run-all-tests

DanielZhangQD · 2022-08-02T06:56:34Z

/merge

ti-chi-bot · 2022-08-02T06:56:37Z

This pull request has been accepted and is ready to merge.

Commit hash: 9b906a5

DanielZhangQD · 2022-08-02T09:27:09Z

/test pull-e2e-kind-serial

fix: error in the log when no pd member fails

f7a0390

ti-chi-bot requested review from handlerww and july2993 June 30, 2022 19:46

unw9527 mentioned this pull request Jun 30, 2022

[tidb] PD failover count throws errors even when there is no error xlab-uiuc/acto#133

Closed

10 tasks

DanielZhangQD reviewed Jul 4, 2022

View reviewed changes

unw9527 mentioned this pull request Jul 5, 2022

Some thoughts after running and inspecting TiDB xlab-uiuc/acto#134

Closed

fix: error log for disabling PD failover

520879f

DanielZhangQD approved these changes Jul 8, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 label Jul 8, 2022

DanielZhangQD requested review from csuzhangxc and KanShiori July 8, 2022 03:21

Merge branch 'master' into FixErrorLog

3242ab7

KanShiori reviewed Jul 8, 2022

View reviewed changes

hoyhbx and others added 3 commits July 22, 2022 14:44

recover failover pods if MaxFailoverCount=0

4ed140f

Merge branch 'FixErrorLog' of github.com:hoyhbx/tidb-operator into Fi…

41e01f8

…xErrorLog

Merge branch 'master' into FixErrorLog

1894fe7

DanielZhangQD reviewed Jul 25, 2022

View reviewed changes

pkg/manager/member/pd_member_manager.go Outdated Show resolved Hide resolved

Update pkg/manager/member/pd_member_manager.go

31e3ffd

KanShiori approved these changes Aug 2, 2022

View reviewed changes

ti-chi-bot added status/LGT2 and removed status/LGT1 labels Aug 2, 2022

Merge branch 'master' into FixErrorLog

9b906a5

ti-chi-bot added the status/can-merge label Aug 2, 2022

ti-chi-bot merged commit c59fdf4 into pingcap:master Aug 2, 2022

xhebox pushed a commit to KanShiori/tidb-operator that referenced this pull request Sep 16, 2022

fix: error in the log when no pd member fails (pingcap#4608)

9dc24f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: error in the log when no pd member fails #4608

fix: error in the log when no pd member fails #4608

hoyhbx commented Jun 30, 2022

ti-chi-bot commented Jun 30, 2022 •

edited

Loading

CLAassistant commented Jun 30, 2022 •

edited

Loading

ti-chi-bot commented Jun 30, 2022

DanielZhangQD Jul 4, 2022

hoyhbx Jul 5, 2022

DanielZhangQD left a comment

codecov-commenter commented Jul 8, 2022 •

edited

Loading

KanShiori Jul 8, 2022

KanShiori Jul 8, 2022

DanielZhangQD Jul 8, 2022

DanielZhangQD Jul 22, 2022

DanielZhangQD Jul 22, 2022

hoyhbx Jul 22, 2022

DanielZhangQD commented Jul 25, 2022

DanielZhangQD commented Aug 2, 2022

DanielZhangQD commented Aug 2, 2022

ti-chi-bot commented Aug 2, 2022

DanielZhangQD commented Aug 2, 2022

fix: error in the log when no pd member fails #4608

fix: error in the log when no pd member fails #4608

Conversation

hoyhbx commented Jun 30, 2022

What problem does this PR solve?

What is changed and how does it work?

Code changes

Tests

Side effects

Related changes

Release Notes

ti-chi-bot commented Jun 30, 2022 • edited Loading

CLAassistant commented Jun 30, 2022 • edited Loading

ti-chi-bot commented Jun 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DanielZhangQD left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 8, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DanielZhangQD commented Jul 25, 2022

DanielZhangQD commented Aug 2, 2022

DanielZhangQD commented Aug 2, 2022

ti-chi-bot commented Aug 2, 2022

DanielZhangQD commented Aug 2, 2022

ti-chi-bot commented Jun 30, 2022 •

edited

Loading

CLAassistant commented Jun 30, 2022 •

edited

Loading

codecov-commenter commented Jul 8, 2022 •

edited

Loading