Fix/pd scale in topo not updated #824

9547 · 2020-09-28T17:24:37Z

What problem does this PR solve?

fix #786

What is changed and how it works?

Delete pd instances in topo.PDServers if pd instances were removed.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

Code changes

Has exported function/method change
Has exported variable/fields change
Has interface methods change
Has persistent data change

Side effects

Possible performance regression
Increased code complexity
Breaking backward compatibility

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release notes:

Fix Scale-in PD Servers does not update TiDB startup script

codecov-commenter · 2020-09-28T17:27:00Z

Codecov Report

Merging #824 into master will increase coverage by 2.19%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #824      +/-   ##
==========================================
+ Coverage   50.53%   52.73%   +2.19%     
==========================================
  Files         258      262       +4     
  Lines       18805    18951     +146     
==========================================
+ Hits         9504     9994     +490     
+ Misses       7794     7399     -395     
- Partials     1507     1558      +51

Flag	Coverage Δ
#cluster	`44.73% <100.00%> (+2.72%)`	⬆️
#dm	`25.16% <0.00%> (-0.02%)`	⬇️
#integrate	`47.75% <100.00%> (+1.87%)`	⬆️
#playground	`22.17% <ø> (ø)`
#tiup	`10.83% <ø> (?)`
#unittest	`18.80% <0.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/cluster/command/scale_in.go	`93.33% <100.00%> (+19.52%)`	⬆️
pkg/cluster/operation/scale_in.go	`54.00% <100.00%> (+2.61%)`	⬆️
pkg/cluster/executor/executor.go	`61.81% <0.00%> (-1.15%)`	⬇️
pkg/utils/mock/mock.go	`7.69% <0.00%> (-0.31%)`	⬇️
pkg/repository/remote/common.go	`0.00% <0.00%> (ø)`
pkg/repository/remote/modify.go	`0.00% <0.00%> (ø)`
pkg/repository/remote/upload.go	`0.00% <0.00%> (ø)`
pkg/tui/tui.go	`80.00% <0.00%> (ø)`
pkg/repository/mirror.go	`48.00% <0.00%> (+0.79%)`	⬆️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da151c5...6d2098c. Read the comment docs.

9547 · 2020-09-28T22:48:54Z

The test_scale_core_tls test case failed, and can be reproduced in local env, eg:

bash /tiup-cluster/tests/tiup-cluster/run.sh test_scale_core_tls

failed of scale in pd failed:

start scale in pd
+ [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/test_scale_core_25107/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/test_scale_core_25107/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.102
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.102
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[172.19.0.103:2379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: CleanupData:false CleanupLog:false RetainDataRoles:[] RetainDataNodes:[]}
Error: failed to scale in: no endpoint available, the last err is: error requesting https://172.19.0.105:2379/pd/api/v1/members/name/pd-172.19.0.103-2379, response: "etcdserver: unhealthy cluster"
, code 500

dig into the pd.log:

[2020/09/28 22:31:22.322 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45176] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.322 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45174] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.424 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45194] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.425 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45196] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.522 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45212] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.522 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45214] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.622 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45232] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.622 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45234] [server-name=] [error="remote error: tls: bad certificate"]

Maybe the TLS's certificate not updated after pd scale-in or some other error. I'll deep dive into it later.

codecov-io · 2020-10-09T14:02:12Z

Codecov Report

Merging #824 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #824      +/-   ##
==========================================
+ Coverage   52.98%   53.01%   +0.03%     
==========================================
  Files         261      261              
  Lines       19005    18999       -6     
==========================================
+ Hits        10069    10072       +3     
+ Misses       7376     7368       -8     
+ Partials     1560     1559       -1

Flag	Coverage Δ
#cluster	`44.86% <100.00%> (+0.03%)`	⬆️
#dm	`25.25% <0.00%> (-0.02%)`	⬇️
#integrate	`47.82% <100.00%> (+0.02%)`	⬆️
#playground	`22.17% <ø> (ø)`
#tiup	`10.78% <ø> (ø)`
#unittest	`20.81% <0.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/cluster/command/scale_in.go	`93.33% <100.00%> (+19.52%)`	⬆️
pkg/cluster/operation/scale_in.go	`54.00% <100.00%> (+1.91%)`	⬆️
pkg/cluster/task/update_meta.go	`63.04% <0.00%> (-4.35%)`	⬇️
pkg/cluster/api/pdapi.go	`53.22% <0.00%> (+0.64%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8f201d...0b80670. Read the comment docs.

9547 · 2020-10-10T20:57:29Z

maybe merged after #836

tests/tiup-cluster/script/scale_core.sh

lucklove

LGTM

9547 force-pushed the fix/pd-scale-in-topo-not-updated branch from 6d2098c to 2304ef8 Compare October 9, 2020 14:01

lucklove reviewed Oct 13, 2020

View reviewed changes

tests/tiup-cluster/script/scale_core.sh Outdated Show resolved Hide resolved

lonng added this to the v1.2.1 milestone Oct 15, 2020

lonng added the type/bug-fix Categorizes PR as a bug-fix label Oct 15, 2020

9547 force-pushed the fix/pd-scale-in-topo-not-updated branch 2 times, most recently from 1af06e9 to df5f156 Compare October 17, 2020 09:56

jsvisa added 5 commits October 17, 2020 17:57

cluster/operation: scale_in pd should update cluster with newly pd

bb9ccc4

cluster/command: dry

4887e70

fix: ~ inside double quote doesn't mean home

daa17f7

tests/tiup-cluster: check result of pd scale-in

42f70d2

tests/tiup-cluster: fix the bash exit 0 short circuit effect

0b80670

9547 force-pushed the fix/pd-scale-in-topo-not-updated branch from df5f156 to 0b80670 Compare October 17, 2020 09:58

lucklove approved these changes Oct 19, 2020

View reviewed changes

ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 19, 2020

lucklove merged commit 308892a into pingcap:master Oct 19, 2020

9547 deleted the fix/pd-scale-in-topo-not-updated branch October 20, 2020 12:29

lucklove pushed a commit that referenced this pull request Oct 23, 2020

Fix/pd scale in topo not updated (#824)

1fb0740

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/pd scale in topo not updated #824

Fix/pd scale in topo not updated #824

9547 commented Sep 28, 2020

codecov-commenter commented Sep 28, 2020 •

edited

Loading

9547 commented Sep 28, 2020

codecov-io commented Oct 9, 2020 •

edited

Loading

9547 commented Oct 10, 2020

lucklove left a comment

Fix/pd scale in topo not updated #824

Fix/pd scale in topo not updated #824

Conversation

9547 commented Sep 28, 2020

What problem does this PR solve?

What is changed and how it works?

Check List

codecov-commenter commented Sep 28, 2020 • edited Loading

Codecov Report

9547 commented Sep 28, 2020

codecov-io commented Oct 9, 2020 • edited Loading

Codecov Report

9547 commented Oct 10, 2020

lucklove left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 28, 2020 •

edited

Loading

codecov-io commented Oct 9, 2020 •

edited

Loading