Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/pd scale in topo not updated #824

Merged
merged 5 commits into from
Oct 19, 2020

Conversation

9547
Copy link
Contributor

@9547 9547 commented Sep 28, 2020

What problem does this PR solve?

fix #786

What is changed and how it works?

Delete pd instances in topo.PDServers if pd instances were removed.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release notes:

Fix Scale-in PD Servers does not update TiDB startup script 

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2020

Codecov Report

Merging #824 into master will increase coverage by 2.19%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #824      +/-   ##
==========================================
+ Coverage   50.53%   52.73%   +2.19%     
==========================================
  Files         258      262       +4     
  Lines       18805    18951     +146     
==========================================
+ Hits         9504     9994     +490     
+ Misses       7794     7399     -395     
- Partials     1507     1558      +51     
Flag Coverage Δ
#cluster 44.73% <100.00%> (+2.72%) ⬆️
#dm 25.16% <0.00%> (-0.02%) ⬇️
#integrate 47.75% <100.00%> (+1.87%) ⬆️
#playground 22.17% <ø> (ø)
#tiup 10.83% <ø> (?)
#unittest 18.80% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/cluster/command/scale_in.go 93.33% <100.00%> (+19.52%) ⬆️
pkg/cluster/operation/scale_in.go 54.00% <100.00%> (+2.61%) ⬆️
pkg/cluster/executor/executor.go 61.81% <0.00%> (-1.15%) ⬇️
pkg/utils/mock/mock.go 7.69% <0.00%> (-0.31%) ⬇️
pkg/repository/remote/common.go 0.00% <0.00%> (ø)
pkg/repository/remote/modify.go 0.00% <0.00%> (ø)
pkg/repository/remote/upload.go 0.00% <0.00%> (ø)
pkg/tui/tui.go 80.00% <0.00%> (ø)
pkg/repository/mirror.go 48.00% <0.00%> (+0.79%) ⬆️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da151c5...6d2098c. Read the comment docs.

@9547
Copy link
Contributor Author

9547 commented Sep 28, 2020

The test_scale_core_tls test case failed, and can be reproduced in local env, eg:

bash /tiup-cluster/tests/tiup-cluster/run.sh test_scale_core_tls

failed of scale in pd failed:

start scale in pd
+ [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/test_scale_core_25107/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/test_scale_core_25107/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.102
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.102
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.104
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.101
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.105
+ [Parallel] - UserSSH: user=tidb, host=172.19.0.103
+ [ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[172.19.0.103:2379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: CleanupData:false CleanupLog:false RetainDataRoles:[] RetainDataNodes:[]}
Error: failed to scale in: no endpoint available, the last err is: error requesting https://172.19.0.105:2379/pd/api/v1/members/name/pd-172.19.0.103-2379, response: "etcdserver: unhealthy cluster"
, code 500

dig into the pd.log:

[2020/09/28 22:31:22.322 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45176] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.322 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45174] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.424 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45194] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.425 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45196] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.522 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45212] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.522 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45214] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.622 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45232] [server-name=] [error="remote error: tls: bad certificate"]
[2020/09/28 22:31:22.622 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=172.19.0.104:45234] [server-name=] [error="remote error: tls: bad certificate"]

Maybe the TLS's certificate not updated after pd scale-in or some other error. I'll deep dive into it later.

@9547 9547 force-pushed the fix/pd-scale-in-topo-not-updated branch from 6d2098c to 2304ef8 Compare October 9, 2020 14:01
@codecov-io
Copy link

codecov-io commented Oct 9, 2020

Codecov Report

Merging #824 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #824      +/-   ##
==========================================
+ Coverage   52.98%   53.01%   +0.03%     
==========================================
  Files         261      261              
  Lines       19005    18999       -6     
==========================================
+ Hits        10069    10072       +3     
+ Misses       7376     7368       -8     
+ Partials     1560     1559       -1     
Flag Coverage Δ
#cluster 44.86% <100.00%> (+0.03%) ⬆️
#dm 25.25% <0.00%> (-0.02%) ⬇️
#integrate 47.82% <100.00%> (+0.02%) ⬆️
#playground 22.17% <ø> (ø)
#tiup 10.78% <ø> (ø)
#unittest 20.81% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/cluster/command/scale_in.go 93.33% <100.00%> (+19.52%) ⬆️
pkg/cluster/operation/scale_in.go 54.00% <100.00%> (+1.91%) ⬆️
pkg/cluster/task/update_meta.go 63.04% <0.00%> (-4.35%) ⬇️
pkg/cluster/api/pdapi.go 53.22% <0.00%> (+0.64%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8f201d...0b80670. Read the comment docs.

@9547
Copy link
Contributor Author

9547 commented Oct 10, 2020

maybe merged after #836

@lonng lonng added this to the v1.2.1 milestone Oct 15, 2020
@lonng lonng added the type/bug-fix Categorizes PR as a bug-fix label Oct 15, 2020
@9547 9547 force-pushed the fix/pd-scale-in-topo-not-updated branch 2 times, most recently from 1af06e9 to df5f156 Compare October 17, 2020 09:56
@9547 9547 force-pushed the fix/pd-scale-in-topo-not-updated branch from df5f156 to 0b80670 Compare October 17, 2020 09:58
Copy link
Member

@lucklove lucklove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 19, 2020
@lucklove lucklove merged commit 308892a into pingcap:master Oct 19, 2020
@9547 9547 deleted the fix/pd-scale-in-topo-not-updated branch October 20, 2020 12:29
lucklove pushed a commit that referenced this pull request Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT1 Indicates that a PR has LGTM 1. type/bug-fix Categorizes PR as a bug-fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale-in PD node does not update TiDB startup script
7 participants