Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the problem where pump would get stuck when local pds are down #4377

Merged
merged 5 commits into from
Jan 20, 2022

Conversation

just1900
Copy link
Contributor

@just1900 just1900 commented Jan 19, 2022

What problem does this PR solve?

Closes #4361

What is changed and how does it work?

For the 2 problems mentioned in #4361 :

  1. add all peermembers to endpoints when initializing etcd client.

  2. Instead of timedout every context in pumpclient, I just making the clientv3.New() to return error when underlying endpoints are not available(see clientv3: clientv3.New() won't return error when no endpoint is available etcd-io/etcd#9877), thus to avoid the following client call get stucked indefinitely.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

Fixed the problem where sync pump would get stuck when the PDs of one Kubernetes cluster are all down in across Kubernetes deployment.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jan 19, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • csuzhangxc
  • july2993

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2022

Codecov Report

Merging #4377 (02e3972) into master (ea8e787) will increase coverage by 3.58%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #4377      +/-   ##
==========================================
+ Coverage   62.64%   66.22%   +3.58%     
==========================================
  Files         184      188       +4     
  Lines       19575    21969    +2394     
==========================================
+ Hits        12263    14550    +2287     
- Misses       6166     6186      +20     
- Partials     1146     1233      +87     
Flag Coverage Δ
e2e 40.85% <66.66%> (?)
unittest 62.62% <0.00%> (-0.03%) ⬇️

@just1900 just1900 changed the title Fixed the problem where pump would get stuck when local pd is done Fixed the problem where pump would get stuck when local pds are down Jan 19, 2022
@@ -401,8 +401,6 @@ var _ = ginkgo.Describe("[Across Kubernetes]", func() {
tc1 := GetTCForAcrossKubernetes(ns1, tcName1, version, clusterDomain, nil)
tc2 := GetTCForAcrossKubernetes(ns2, tcName2, version, clusterDomain, tc1)
tc3 := GetTCForAcrossKubernetes(ns3, tcName3, version, clusterDomain, tc1)
// FIXME(jsut1900): remove this after #4361 get fixed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skip TiKV in L526?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have failed Tikv before failing PD, though it should make no difference to restart a failed tikv pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's different, the first part is to check that Pods can restart successfully after all TiKV down, and the second part is to check that Pods can restart successfully after all PD down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #4382

@DanielZhangQD
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 13bde1b

@ti-chi-bot
Copy link
Member

@just1900: Your PR was out of date, I have automatically updated it for you.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@just1900
Copy link
Contributor Author

/test pull-e2e-kind-br

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-across-kubernetes

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-basic

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-across-kubernetes

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-br

@just1900
Copy link
Contributor Author

/run-all-tests

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-basic

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-br

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind-across-kubernetes

@ti-chi-bot ti-chi-bot merged commit 009bc87 into pingcap:master Jan 20, 2022
@just1900 just1900 deleted the fix-pump-x-k8s branch January 21, 2022 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync for Pump is blocked when the PDs in one Kubernetes cluster are all down in across Kubernetes deployment
6 participants