-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.15 - kubeadm join --control-plane fails on clusters created with <= 1.12 #1950
Comments
/kind bug |
FYI, 1.12 is no longer supported by the kubeadm team, which forces me to close the issue, but we can continue the discussion. /close with the release of 1.17 you need to have at least 1.15 to be in the support skew.
did you try the workaround here: #1269 also instead of joining before upgrade, did you try upgrading to 1.13 and then joining a new member? ping @fabriziopandini |
@neolit123: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@neolit123 This bug report is about adding more control plane nodes in a 1.15.6 cluster, initially created by kubeadm 1.12 or older, and upgraded continually. That should be within the support window right? As I noted, there is a workaround (changing the peer url by hand), but I'm thinking that kubeadm should handle it more transparently, since there's probably a lot more clusters like ours out there. |
looks like i misunderstood that part.
this is the second report we got about this since 1.12. for the first one this is what we've added in the docs: IIRC, kubeadm upgrade had a special case to fix this in 1.13 (which for some reason might have not worked in your case). @fabriziopandini should confirm if he remembers. |
There are no instructions on how to setup HA for an existing cluster, and HA is only beta in 1.15, so there might be more than me having this issue eventually, if instructions are posted on the website. I'm trying to make my own experience easier for the time when I get to upgrading production. I'm not sure what was done in the last round, but I think it was only making etcd listen on host ip, not adjusting the member address. |
Related #1471 |
we did a survey which suggested that the kubeadm user base tries to stay in the support skew - i.e. upgrade much faster.
looks like a lot more affected users were present there. |
you are suggesting that https://github.com/kubernetes/kubernetes/pull/75956/files was not sufficient? |
Note that we are not upgrading from 1.12 to 1.15 right now, our clusters are almost 600 days old and been upgraded every few months. We try our best to be within the support period. But we have been waiting for a time where HA is more mature before upgrading our single master to multi master.
My go-reading-skills aren't the best, but to me it looks like it's all related to certificates. And that part works fine. But it's only half a solution for adding more masters, since the inital etcd node thinks it's listening on localhost only and tells that to the second etcd joining. |
I've taken a look at the state before and after the upgrade from 1.13 to 1.14. In 1.13 the etcd manifest looked like this:
And it was updated to this in 1.14 (and still like this in 1.15):
So it fixed the
And the "peer addrs" there seems to be what's being used later on, when joining more master nodes. |
this seems to me as if the etcd pod was not restarted. odd. |
I think it's a persisted configuration, so it's only read from the manifest during initial setup of etcd. Changing it requires running commands with etcdctl or similar, like kubeadm does when adding a second node to the etcd cluster. That's why I was thinking that it could possibly be fixed automatically by kubeadm when it's already in the process of making changes to etcd state. |
changes in the etcd static pod manifest would trigger a pod restart, which is an etcd restart with a new argument for are you sure this is persisted? i don't see how - something is not right here. |
I did some tests:
Then had a look at the etcd documentation, where I found what I was suspecting:
The |
Here is a log from joining a second control plane node after using
With the exception of this completing successfully, there is only one difference in the logs between the original attempt that timed out and failed, and this last one that succeeded. From the original:
From this last one:
I've also compared logs of etcd, before and after changing PEER ADDRS using
To summarize how it looks to me:
|
^ |
I have sent a PR for this. Not sure if it's right. |
going back to this issue and the PR kubernetes/kubernetes#86150 otherwise we have to backport the PR to all branches in the current support skew 1.15, 1.16, 1.17. aside from this bug, i don't think MemberUpdate() is currently needed and for upgrades it can be considered reconfiguration, which is something upgrades should not do. |
/kind documentation |
I am also having same issue in CentOS 8, Containerd setting up new cluster. first master works, when I add second master etcd pod dies in first master. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
hi, please send a PR to the kubeadm troubleshooting guide with some steps of how to recover from this: you know what is best here and i don't have the bandwidth to reproduce this and document what has to be done. thank you. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Versions
kubeadm version (use
kubeadm version
): v1.15.6Environment: Dev
kubectl version
): v1.15.6uname -a
): 3.10.0-957.1.3.el7.x86_64What happened?
I have several clusters created with kubeadm v1.10 to v1.12, that have been upgraded along the way. Currently on 1.14 and 1.15. I'm experimenting with adding more masters to setup HA. Adding masters on clusters created with kubeadm 1.15 is working fine, but when adding masters to older clusters upgraded to 1.15 it fails waiting for etcd nodes to join.
This is a continuation of #1269, which doesn't seem to be properly resolved.
The original issue relates to etcd not listening on a host port, so it's not possible for the new node to connect. That was fixed. However, the etcd member list seems to be untouched, so it looks as follows:
First master: demomaster1test (192.168.33.10).
Second master: demomaster2test (192.168.33.20). (To be added)
From the join on the second control plane node we can see that it successfully adds the second etcd member to the cluster using the correct address, then receives a member list with the localhost address of the first member, and then eventually times out:
From the logs of the first etcd we can see the second etcd joining, then the first etcd starts leader election, not getting contact with the second etcd, and then shutting down:
On the second etcd we get this:
The second etcd keeps trying to connect to the first etcd on localhost.
What we can see from the generated etcd.yaml manifest on the second master is this:
It's configured demomaster1test at https://127.0.0.1:2380, which results in "connection refused" as we can see from the logs. Trying to change that value to https://192.168.33.10:2380 results in the following in the logs instead:
The configuration of the address in the manifest doesn't match the member list and it aborts.
The result in any case is that etcd on both control plane nodes shut down, and the apiserver is unavailable as a consequence, bricking the entire cluster.
A possible fix is to change the etcd member peer address before adding a second master, like this:
After doing so I was able to add a second master.
What you expected to happen?
The peer address of the first etcd should have been updated to host ip either as part of an etcd upgrade or when adding the second control plane node.
How to reproduce it (as minimally and precisely as possible)?
Adapted from the instructions at https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/
Anything else we need to know?
The text was updated successfully, but these errors were encountered: