Fixed the problem where pump would get stuck when local pds are down #4377

just1900 · 2022-01-19T06:59:42Z

What problem does this PR solve?

Closes #4361

What is changed and how does it work?

For the 2 problems mentioned in #4361 :

add all peermembers to endpoints when initializing etcd client.
Instead of timedout every context in pumpclient, I just making the clientv3.New() to return error when underlying endpoints are not available(see clientv3: clientv3.New() won't return error when no endpoint is available etcd-io/etcd#9877), thus to avoid the following client call get stucked indefinitely.

Code changes

Has Go code change
Has CI related scripts change

Tests

Unit test
E2E test
Manual test
No code

Side effects

Breaking backward compatibility
Other side effects:

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

Fixed the problem where sync pump would get stuck when the PDs of one Kubernetes cluster are all down in across Kubernetes deployment.

ti-chi-bot · 2022-01-19T06:59:43Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

csuzhangxc
july2993

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

codecov-commenter · 2022-01-19T07:09:01Z

Codecov Report

Merging #4377 (02e3972) into master (ea8e787) will increase coverage by 3.58%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #4377      +/-   ##
==========================================
+ Coverage   62.64%   66.22%   +3.58%     
==========================================
  Files         184      188       +4     
  Lines       19575    21969    +2394     
==========================================
+ Hits        12263    14550    +2287     
- Misses       6166     6186      +20     
- Partials     1146     1233      +87

Flag	Coverage Δ
e2e	`40.85% <66.66%> (?)`
unittest	`62.62% <0.00%> (-0.03%)`	⬇️

pkg/manager/member/pump_member_manager.go

pkg/binlog/binlog.go

DanielZhangQD · 2022-01-19T11:15:32Z

tests/e2e/tidbcluster/across-kubernetes.go

@@ -401,8 +401,6 @@ var _ = ginkgo.Describe("[Across Kubernetes]", func() {
 			tc1 := GetTCForAcrossKubernetes(ns1, tcName1, version, clusterDomain, nil)
 			tc2 := GetTCForAcrossKubernetes(ns2, tcName2, version, clusterDomain, tc1)
 			tc3 := GetTCForAcrossKubernetes(ns3, tcName3, version, clusterDomain, tc1)
-			// FIXME(jsut1900): remove this after #4361 get fixed.


Why skip TiKV in L526?

We have failed Tikv before failing PD, though it should make no difference to restart a failed tikv pod.

It's different, the first part is to check that Pods can restart successfully after all TiKV down, and the second part is to check that Pods can restart successfully after all PD down.

addressed in #4382

DanielZhangQD · 2022-01-20T03:12:12Z

/merge

ti-chi-bot · 2022-01-20T03:12:16Z

This pull request has been accepted and is ready to merge.

Commit hash: 13bde1b

ti-chi-bot · 2022-01-20T03:15:55Z

@just1900: Your PR was out of date, I have automatically updated it for you.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

just1900 · 2022-01-20T05:14:02Z

/test pull-e2e-kind-br

DanielZhangQD · 2022-01-20T05:41:56Z

/test pull-e2e-kind-across-kubernetes

DanielZhangQD · 2022-01-20T05:42:03Z

/test pull-e2e-kind

DanielZhangQD · 2022-01-20T05:42:11Z

/test pull-e2e-kind-basic

DanielZhangQD · 2022-01-20T07:24:05Z

/test pull-e2e-kind-across-kubernetes

DanielZhangQD · 2022-01-20T07:24:15Z

/test pull-e2e-kind-br

just1900 · 2022-01-20T08:39:30Z

/run-all-tests

DanielZhangQD · 2022-01-20T10:31:58Z

/test pull-e2e-kind-basic

DanielZhangQD · 2022-01-20T12:09:39Z

/test pull-e2e-kind

DanielZhangQD · 2022-01-20T12:09:46Z

/test pull-e2e-kind-br

DanielZhangQD · 2022-01-20T12:10:02Z

/test pull-e2e-kind-across-kubernetes

fix pump client stuck problem when local pd is done

206bdfa

ti-chi-bot requested review from DanielZhangQD and july2993 January 19, 2022 06:59

just1900 changed the title ~~Fixed the problem where pump would get stuck when local pd is done~~ Fixed the problem where pump would get stuck when local pds are down Jan 19, 2022

csuzhangxc reviewed Jan 19, 2022

View reviewed changes

pkg/manager/member/pump_member_manager.go Outdated Show resolved Hide resolved

address comments

4fd9ad7

csuzhangxc approved these changes Jan 19, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 label Jan 19, 2022

july2993 approved these changes Jan 19, 2022

View reviewed changes

ti-chi-bot added status/LGT2 and removed status/LGT1 labels Jan 19, 2022

DanielZhangQD reviewed Jan 19, 2022

View reviewed changes

pkg/binlog/binlog.go Show resolved Hide resolved

add timeout for operations in binlogclient

26d8cd5

DanielZhangQD reviewed Jan 19, 2022

View reviewed changes

Merge branch 'master' into fix-pump-x-k8s

13bde1b

ti-chi-bot added the status/can-merge label Jan 20, 2022

Merge branch 'master' into fix-pump-x-k8s

02e3972

just1900 mentioned this pull request Jan 20, 2022

check tikv restart status when pd failed and refactor e2e test code #4382

Merged

10 tasks

ti-chi-bot merged commit 009bc87 into pingcap:master Jan 20, 2022

just1900 deleted the fix-pump-x-k8s branch January 21, 2022 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the problem where pump would get stuck when local pds are down #4377

Fixed the problem where pump would get stuck when local pds are down #4377

just1900 commented Jan 19, 2022 •

edited

Loading

ti-chi-bot commented Jan 19, 2022 •

edited

Loading

codecov-commenter commented Jan 19, 2022 •

edited

Loading

DanielZhangQD Jan 19, 2022

just1900 Jan 19, 2022

DanielZhangQD Jan 20, 2022

just1900 Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

ti-chi-bot commented Jan 20, 2022

ti-chi-bot commented Jan 20, 2022

just1900 commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

just1900 commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

Fixed the problem where pump would get stuck when local pds are down #4377

Fixed the problem where pump would get stuck when local pds are down #4377

Conversation

just1900 commented Jan 19, 2022 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

Code changes

Tests

Side effects

Related changes

Release Notes

ti-chi-bot commented Jan 19, 2022 • edited Loading

codecov-commenter commented Jan 19, 2022 • edited Loading

Codecov Report

DanielZhangQD Jan 19, 2022

Choose a reason for hiding this comment

just1900 Jan 19, 2022

Choose a reason for hiding this comment

DanielZhangQD Jan 20, 2022

Choose a reason for hiding this comment

just1900 Jan 20, 2022

Choose a reason for hiding this comment

DanielZhangQD commented Jan 20, 2022

ti-chi-bot commented Jan 20, 2022

ti-chi-bot commented Jan 20, 2022

just1900 commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

just1900 commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

DanielZhangQD commented Jan 20, 2022

just1900 commented Jan 19, 2022 •

edited

Loading

ti-chi-bot commented Jan 19, 2022 •

edited

Loading

codecov-commenter commented Jan 19, 2022 •

edited

Loading