-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: error in the log when no pd member fails #4608
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
@hoyhbx: GitHub didn't allow me to request PR reviews from the following users: reviewer. Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/manager/member/pd_failover.go
Outdated
@@ -66,7 +66,7 @@ func (f *pdFailover) Failover(tc *v1alpha1.TidbCluster) error { | |||
} | |||
|
|||
pdDeletedFailureReplicas := tc.GetPDDeletedFailureReplicas() | |||
if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount { | |||
if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount && pdDeletedFailureReplicas > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your case, you have set Spec.PD.MaxFailoverCount=0
, right? Which means that you want to disable failover for PD, in this case, I think we should update the code here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/pd_member_manager.go#L241 as below:
if m.deps.CLIConfig.AutoFailover && tc.Spec.PD.MaxFailoverCount != nil && *tc.Spec.PD.MaxFailoverCount > 0 {
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DanielZhangQD Thanks for your reply. Yes, we think that your suggestion is correct, and we have updated the code in the latest commit in this PR now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report
@@ Coverage Diff @@
## master #4608 +/- ##
==========================================
+ Coverage 62.65% 72.30% +9.64%
==========================================
Files 186 190 +4
Lines 20849 23342 +2493
==========================================
+ Hits 13063 16877 +3814
+ Misses 6589 5273 -1316
+ Partials 1197 1192 -5
|
@@ -238,7 +238,7 @@ func (m *pdMemberManager) syncPDStatefulSetForTidbCluster(tc *v1alpha1.TidbClust | |||
return err | |||
} | |||
|
|||
if m.deps.CLIConfig.AutoFailover { | |||
if m.deps.CLIConfig.AutoFailover && tc.Spec.PD.MaxFailoverCount != nil && *tc.Spec.PD.MaxFailoverCount > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if failover occurs and set the max failover count to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can not arrive the Recover
func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failover Pods will not be scaled in and the user needs to remove the failure member manually in the status.pd then the failover Pods will be scaled in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hoyhbx After checking the logic of the other component, I think @KanShiori 's above comment is reasonable, we should be able to recover the previous failover Pods even if we set MaxFailoverCount to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the check in L244?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have changed it. @DanielZhangQD
/run-all-tests |
/run-all-tests |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 9b906a5
|
/test pull-e2e-kind-serial |
What problem does this PR solve?
When failovering a broken PD member, this line gives us an error even if no PD member fails (i.e.
pdDeletedFailureReplicas=0
). When we inspect the error log, we find that the error message “PD failover replicas (0) reaches the limit (0), skip failover” is constantly written to the operator log and is very confusing to users. However, there is actually no error if no PD member fails. We believe whenpdDeletedFailureReplicas=0
, the operator should not throw the error message “PD failover replicas (0) reaches the limit (0), skip failover“.This PR provides a straightforward fix for this problem.
What is changed and how does it work?
We fix it by adding one more condition in the if statement in this line to check whether
pdDeletedFailureReplicas
is larger than 0, as illustrated below:Code changes
Tests
This is a simple fix and we suppose no test above is needed.
Side effects
Related changes
Release Notes
Please refer to Release Notes Language Style Guide before writing the release note.