Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: error in the log when no pd member fails #4608

Merged
merged 8 commits into from
Aug 2, 2022

Conversation

hoyhbx
Copy link
Contributor

@hoyhbx hoyhbx commented Jun 30, 2022

What problem does this PR solve?

When failovering a broken PD member, this line gives us an error even if no PD member fails (i.e. pdDeletedFailureReplicas=0). When we inspect the error log, we find that the error message “PD failover replicas (0) reaches the limit (0), skip failover” is constantly written to the operator log and is very confusing to users. However, there is actually no error if no PD member fails. We believe when pdDeletedFailureReplicas=0, the operator should not throw the error message “PD failover replicas (0) reaches the limit (0), skip failover“.
This PR provides a straightforward fix for this problem.

What is changed and how does it work?

We fix it by adding one more condition in the if statement in this line to check whether pdDeletedFailureReplicas is larger than 0, as illustrated below:

if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount && pdDeletedFailureReplicas > 0 {
	klog.Errorf(“PD failover replicas (%d) reaches the limit (%d), skip failover”, pdDeletedFailureReplicas, *tc.Spec.PD.MaxFailoverCount)
	return nil
}

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code
    This is a simple fix and we suppose no test above is needed.

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

NONE

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jun 30, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • DanielZhangQD
  • KanShiori

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@CLAassistant
Copy link

CLAassistant commented Jun 30, 2022

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot
Copy link
Member

@hoyhbx: GitHub didn't allow me to request PR reviews from the following users: reviewer.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @Reviewer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -66,7 +66,7 @@ func (f *pdFailover) Failover(tc *v1alpha1.TidbCluster) error {
}

pdDeletedFailureReplicas := tc.GetPDDeletedFailureReplicas()
if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount {
if pdDeletedFailureReplicas >= *tc.Spec.PD.MaxFailoverCount && pdDeletedFailureReplicas > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your case, you have set Spec.PD.MaxFailoverCount=0, right? Which means that you want to disable failover for PD, in this case, I think we should update the code here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/pd_member_manager.go#L241 as below:
if m.deps.CLIConfig.AutoFailover && tc.Spec.PD.MaxFailoverCount != nil && *tc.Spec.PD.MaxFailoverCount > 0 {
WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielZhangQD Thanks for your reply. Yes, we think that your suggestion is correct, and we have updated the code in the latest commit in this PR now.

Copy link
Contributor

@DanielZhangQD DanielZhangQD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2022

Codecov Report

Merging #4608 (9b906a5) into master (af0fda7) will increase coverage by 9.64%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4608      +/-   ##
==========================================
+ Coverage   62.65%   72.30%   +9.64%     
==========================================
  Files         186      190       +4     
  Lines       20849    23342    +2493     
==========================================
+ Hits        13063    16877    +3814     
+ Misses       6589     5273    -1316     
+ Partials     1197     1192       -5     
Flag Coverage Δ
e2e 58.99% <0.00%> (?)
unittest 62.65% <0.00%> (-0.01%) ⬇️

@@ -238,7 +238,7 @@ func (m *pdMemberManager) syncPDStatefulSetForTidbCluster(tc *v1alpha1.TidbClust
return err
}

if m.deps.CLIConfig.AutoFailover {
if m.deps.CLIConfig.AutoFailover && tc.Spec.PD.MaxFailoverCount != nil && *tc.Spec.PD.MaxFailoverCount > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if failover occurs and set the max failover count to 0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can not arrive the Recover func?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failover Pods will not be scaled in and the user needs to remove the failure member manually in the status.pd then the failover Pods will be scaled in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoyhbx After checking the logic of the other component, I think @KanShiori 's above comment is reasonable, we should be able to recover the previous failover Pods even if we set MaxFailoverCount to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the check in L244?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have changed it. @DanielZhangQD

@DanielZhangQD
Copy link
Contributor

/run-all-tests

@DanielZhangQD
Copy link
Contributor

/run-all-tests

@DanielZhangQD
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 9b906a5

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-serial

@ti-chi-bot ti-chi-bot merged commit c59fdf4 into pingcap:master Aug 2, 2022
xhebox pushed a commit to KanShiori/tidb-operator that referenced this pull request Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants