-
-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Creation stuck on waiting for the system-upgrade-controller #934
Comments
It's related to this TLDR: |
@mducoli @DoubleREW Please try setting That should make it work. Now we also need to investigate why it's failing on v1.27, @kube-hetzner/core please don't hesitate to take this on, I'm a bit short on time lately. And PR welcome if anyone find a fix to make v1.27 work with single node clusters. Don't hesitate to refer to the debug section in the readme. |
@mysticaltech |
In Kubernetes 1.27 there was a change where a new feature (kubernetes/kubernetes#108838) was added. This will attempt to listen on port Listening ports in K3s 1.27: # ss -tlpen
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 4096 127.0.0.1:10248 0.0.0.0:* users:(("k3s-server",pid=1938,fd=165)) ino:20446 sk:102e cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:10249 0.0.0.0:* users:(("k3s-server",pid=1938,fd=200)) ino:22471 sk:102f cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:10256 0.0.0.0:* users:(("k3s-server",pid=1938,fd=198)) ino:22469 sk:1030 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:10257 0.0.0.0:* users:(("k3s-server",pid=1938,fd=177)) ino:20994 sk:1031 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:10259 0.0.0.0:* users:(("k3s-server",pid=1938,fd=155)) ino:20796 sk:1032 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 10.0.255.101:2379 0.0.0.0:* users:(("k3s-server",pid=1938,fd=11)) ino:20227 sk:1033 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 10.0.255.101:2380 0.0.0.0:* users:(("k3s-server",pid=1938,fd=9)) ino:20225 sk:1034 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:2380 0.0.0.0:* users:(("k3s-server",pid=1938,fd=10)) ino:20226 sk:1035 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:2381 0.0.0.0:* users:(("k3s-server",pid=1938,fd=15)) ino:19263 sk:1036 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:2379 0.0.0.0:* users:(("k3s-server",pid=1938,fd=12)) ino:20228 sk:1037 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:6444 0.0.0.0:* users:(("k3s-server",pid=1938,fd=18)) ino:19392 sk:1038 cgroup:/system.slice/k3s.service <->
LISTEN 0 4096 127.0.0.1:9879 0.0.0.0:* users:(("cilium-agent",pid=3805,fd=36)) ino:29376 sk:1039 cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4dd4c59_73d5_4c9d_8e32_67ffe04260e5.slice/cri-containerd-0ad1a1b8dcadffaca1a7414c87e093087902f697b29e5052d6548461db831f00.scope <->
LISTEN 0 4096 127.0.0.1:9890 0.0.0.0:* users:(("cilium-agent",pid=3805,fd=6)) ino:28724 sk:103a cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4dd4c59_73d5_4c9d_8e32_67ffe04260e5.slice/cri-containerd-0ad1a1b8dcadffaca1a7414c87e093087902f697b29e5052d6548461db831f00.scope <->
LISTEN 0 4096 127.0.0.1:9891 0.0.0.0:* users:(("cilium-operator",pid=3363,fd=3)) ino:26911 sk:103b cgroup:/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode8ff8932_d26c_408f_8961_e0fd7f116aa5.slice/cri-containerd-b1897e301212c9a42b2f06fbb943a832db245fea6afdf8f34759358cd0fc4193.scope <->
LISTEN 0 4096 127.0.0.1:10010 0.0.0.0:* users:(("containerd",pid=1989,fd=11)) ino:20295 sk:103c cgroup:/system.slice/k3s.service <->
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1317,fd=3)) ino:18093 sk:103d cgroup:/system.slice/sshd.service <->
LISTEN 0 4096 127.0.0.1:9234 0.0.0.0:* users:(("cilium-operator",pid=3363,fd=9)) ino:26922 sk:103e cgroup:/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode8ff8932_d26c_408f_8961_e0fd7f116aa5.slice/cri-containerd-b1897e301212c9a42b2f06fbb943a832db245fea6afdf8f34759358cd0fc4193.scope <->
LISTEN 0 4096 127.0.0.1:36121 0.0.0.0:* users:(("cilium-agent",pid=3805,fd=30)) ino:28497 sk:103f fwmark:0xb00 cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4dd4c59_73d5_4c9d_8e32_67ffe04260e5.slice/cri-containerd-0ad1a1b8dcadffaca1a7414c87e093087902f697b29e5052d6548461db831f00.scope <->
LISTEN 0 4096 *:9962 *:* users:(("cilium-agent",pid=3805,fd=8)) ino:28433 sk:1040 cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4dd4c59_73d5_4c9d_8e32_67ffe04260e5.slice/cri-containerd-0ad1a1b8dcadffaca1a7414c87e093087902f697b29e5052d6548461db831f00.scope v6only:0 <->
LISTEN 0 4096 *:4244 *:* users:(("cilium-agent",pid=3805,fd=55)) ino:29421 sk:1041 cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4dd4c59_73d5_4c9d_8e32_67ffe04260e5.slice/cri-containerd-0ad1a1b8dcadffaca1a7414c87e093087902f697b29e5052d6548461db831f00.scope v6only:0 <->
LISTEN 0 4096 *:4240 *:* users:(("cilium-agent",pid=3805,fd=66)) ino:30256 sk:1042 cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4dd4c59_73d5_4c9d_8e32_67ffe04260e5.slice/cri-containerd-0ad1a1b8dcadffaca1a7414c87e093087902f697b29e5052d6548461db831f00.scope v6only:0 <->
LISTEN 0 4096 *:10250 *:* users:(("k3s-server",pid=1938,fd=150)) ino:20442 sk:1043 cgroup:/system.slice/k3s.service v6only:0 <->
LISTEN 0 128 [::]:22 [::]:* users:(("sshd",pid=1317,fd=4)) ino:18095 sk:1044 cgroup:/system.slice/sshd.service v6only:1 <->
LISTEN 0 4096 *:10260 *:* users:(("k3s-server",pid=1938,fd=199)) ino:23674 sk:1045 cgroup:/system.slice/k3s.service v6only:0 <->
LISTEN 0 4096 *:6443 *:* users:(("k3s-server",pid=1938,fd=7)) ino:20222 sk:1046 cgroup:/system.slice/k3s.service v6only:0 <-> With version 1.27 K3s binds itself to port # kubectl logs -f -n kube-system hcloud-cloud-controller-manager-b6b64df54-jkfwt
Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0815 10:43:05.610031 1 serving.go:348] Generated self-signed cert in-memory
I0815 10:43:06.907824 1 serving.go:348] Generated self-signed cert in-memory
W0815 10:43:06.907870 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
failed to create listener: failed to listen on 0.0.0.0:10260: listen tcp 0.0.0.0:10260: bind: address already in use
Error: failed to create listener: failed to listen on 0.0.0.0:10260: listen tcp 0.0.0.0:10260: bind: address already in use I don't know exactly why Hetzner CCM is trying to listen on this port. It seems that this only happens for a few seconds on startup and after that the port is released. I saw the behaiviour on 1.26 for a short while during deployment with # ss -tlpen
[...]
LISTEN 0 4096 *:10260 *:* users:(("hcloud-cloud-co",pid=11545,fd=3)) ino:152602 sk:9dae cgroup:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1dd0976e_3b5d_419d_adf0_5a1cdecbb677.slice/cri-containerd-30c55c84ce8ebb29b4584a59efe59a69c24aa9cba792ae9dc5062cafbad451db.scope v6only:0 <-> |
Hello 👋 Are you still running the k3s-included cloud-provider in addition to HCCM? That might explain why k3s is also trying to register that port. |
Hey @apricote, thank you for reaching out! 🙂 I hope we don't do so. Here are all places where we set
Here is where we disable the embedded CCM:
Here are the [
"server",
"--advertise-address", "10.0.255.101",
"--cluster-cidr", "10.0.128.0/17",
"--cluster-init", "true",
"--disable", "local-storage",
"--disable", "traefik",
"--disable-cloud-controller", "true",
"--disable-network-policy", "true",
"--flannel-backend", "none",
"--flannel-iface", "eth1",
"--kube-controller-manager-arg", "flex-volume-plugin-dir=/var/lib/kubelet/volumeplugins",
"--kubelet-arg", "cloud-provider=external",
"--kubelet-arg", "volume-plugin-dir=/var/lib/kubelet/volumeplugins",
"--node-ip", "10.0.255.101",
"--node-label", "k3s_upgrade=true",
"--node-name", "k3a-p-ctl-hoq-hnr-lxs",
"--selinux", "true",
"--service-cidr", "10.0.64.0/18",
"--tls-san", "78.46.202.228",
"--token", "********",
] -> See ... I hope we don't have a misconfiguration here. Before 1.27 this port binding was not happening in K3s, but the Hetzner CCM seems to to it always at startup for a couple of seconds (I still don't know why and I could not find anything in the code) and releases the port afterwards. I'm not sure if it is now the intention of Kubernetes to always bind to this port, as I don't fully understand the intention of their change in kubernetes/kubernetes#108838. |
From my POV it's a bug/undesired behavior in hccm (coming from k/cloud-provider). I tried debugging it today, because the code that serves the port is behind two checks that should be negative (feature flag + len(webhooks)>0) but I was not successful yet. Will continue to investigate tomorrow. |
Just some additional info in case it was not clear from this issue history. In this particular case, K3s is deployed on a single node. First I thought it could be specific to single node deployments, but it seems to happen also in cluster deployments. Here a test with 3 CP and 3 Worker nodes and the status changes of the HCCM Pod: # kubectl get pods -A -o wide | grep hcloud-cloud-controller-manager
kube-system hcloud-cloud-controller-manager-b6b64df54-726vw 0/1 Pending 0 35s <none> <none> <none> <none>
[...]
# kubectl get pods -A -o wide | grep hcloud-cloud-controller-manager
kube-system hcloud-cloud-controller-manager-b6b64df54-726vw 0/1 ContainerCreating 0 36s 10.0.255.101 k3a-p-ctl-hoq-hnr-pvj <none> <none>
[...]
# kubectl get pods -A -o wide | grep hcloud-cloud-controller-manager
kube-system hcloud-cloud-controller-manager-b6b64df54-726vw 1/1 Running 0 40s 10.0.255.101 k3a-p-ctl-hoq-hnr-pvj <none> <none>
[...]
# kubectl get pods -A -o wide | grep hcloud-cloud-controller-manager
kube-system hcloud-cloud-controller-manager-b6b64df54-726vw 0/1 Error 0 43s 10.0.255.101 k3a-p-ctl-hoq-hnr-pvj <none> <none>
# kubectl get pods -A -o wide | grep hcloud-cloud-controller-manager
[...]
# kubectl get pods -A -o wide | grep hcloud-cloud-controller-manager
kube-system hcloud-cloud-controller-manager-b6b64df54-726vw 0/1 CrashLoopBackOff 4 (80s ago) 3m45s 10.0.255.101 k3a-p-ctl-hoq-hnr-pvj <none> <none
[...] -> The HCCM Pod has been scheduled on a CP node and fails for the exact same reason: # kubectl logs -n kube-system hcloud-cloud-controller-manager-b6b64df54-726vw
Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0815 14:49:15.065990 1 serving.go:348] Generated self-signed cert in-memory
I0815 14:49:15.850851 1 serving.go:348] Generated self-signed cert in-memory
W0815 14:49:15.850888 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
failed to create listener: failed to listen on 0.0.0.0:10260: listen tcp 0.0.0.0:10260: bind: address already in use
Error: failed to create listener: failed to listen on 0.0.0.0:10260: listen tcp 0.0.0.0:10260: bind: address already in use
Usage:
cloud-controller-manager [flags]
[...] ... and K3s CP nodes bind themselves on port # ss -tlpen | grep 10260
LISTEN 0 4096 *:10260 *:* users:(("k3s-server",pid=1936,fd=196)) ino:23034 sk:f cgroup:/system.slice/k3s.service v6only:0 <-> But here comes the interesting part, this only happens if we use Klipper LB and disable the Hetzner Load Balancer! If you are lucky and the HCCM was scheduled on a worker node, still everything is fine, because there is no port conflict anymore. But if it was scheduled on a control plane node, it fails due to the port conflict. In case the Hetzner LB is used, K3s does not bind on port We normally don't allow scheduling on control plane nodes with Why K3s control plane nodes bind on port
-> My wild guess would be this feature could be now in both, K3s CCM and HCCM. Since HCCM is not provided as a service from Hetzner Cloud, but is running inside of the K3s cluster, they can be in conflict in case K3s CCM (probably partially enabled because of Klipper LB) and HCCM are running on the same node. |
I have found the source of the port binding and opened an issue upstream for it: kubernetes/kubernetes#120043. As a workaround you can set |
Fix is published as v1.17.2. Thanks for the debugging help (and noticing the issue). |
I think the upstream bug is fixed? hetznercloud/hcloud-cloud-controller-manager#492 just run edit: I just edit the version via |
Should be fixed now as #938 has been merged and released. |
Description
Hi, I'm trying to deploy a single node cluster but the creation gets stuck on waiting for the system-upgrade-controller deployment to become available (stuck on "Still creating...").
Here is the log:
I read many issues but didn't find anything useful (#623 and #311 in particular).
I'm fairly new to Kubernetes and this project so I don't know if this is useful but logging into the machine I found that the system-upgrade-controller couldn't start because "had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}".
I also tried more than 10 different kube.tf but nothing changed.
Thanks for your help.
Kube.tf file
Screenshots
No response
Platform
Linux
The text was updated successfully, but these errors were encountered: