-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to add a second control plane due to control-plane-join/etcd unable to start etcd as learner member #2997
Comments
There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:
Please see the group list for a listing of the SIGs, working groups, and committees available. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/transfer kubeadm |
any more context about this?
looks like the learning member never syncs with the existing leader member.
this i don't like. you seem to be using the IP of the existing control plane machine to join the second control plane machine. doing the switch now is not trivial... https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/ the above is my only suspect on why you might be getting this strange error. FWIW, we have e2e tests about learner mode and it's already used in production by users, as it's enabled by default in kubeadm 1.29. |
cc @pacoxu |
I wondered if you can join successfully with
The
Can you check the etcd logs from the two nodes? Sync failed makes me think the time is not synced(do you use ntp or chronyd to make the time the same.).(only a guess). |
the new node can reach the master etcd: openssl s_client -connect 10.254.0.4:2380 --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = a**************e
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = a**************e
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN = a**************e
verify return:1
---
Certificate chain
0 s:CN = a**************e
i:C = FR, ST = Bouches du Rh\C3\83\C2\B4ne, L = Marseille, O =***********, OU = Admin, CN = ********-etcd-CA, emailAddress = admin-ca@*********
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
v:NotBefore: Mar 19 08:42:18 2023 GMT; NotAfter: Dec 22 19:05:22 2024 GMT I know using a LB is the good way, but it should working without. @pacoxu with |
This is more like an etcd problem and we may need some help from etcd maintainers. /sig etcd https://etcd.io/docs/v3.4/op-guide/runtime-configuration/#error-cases-when-promoting-a-learner-member |
It's simple, it just means that the learner isn't in sync with the leader's data. Specifically, learner's index < 90% of leader's index. Please just execute the following two commands and share the output,
|
The timeout is 2 minute to waiting for learner to become ready.
This will be configurable with v1beta4(which is not released/enabled yet). |
@eltorio please reply to this when possible #2997 (comment)
normally it gets ready within seconds. e.g. 10 |
It might be better to include a "match" percentage in the log so that the user can see the progress more clearly. |
if technically possible, doesn’t seem like a bad idea as long as this % progresses well and is accurate on the output feed. |
@neolit123 unfortunatly I see your message only now… I join the new control plane with MEMBER=mikado
ETCDID=$(cluster_etcd_get_member_list | sed -n "s/^\([a-f0-9]*\), .*, $MEMBER.*/\1/p")
sudo ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove $ETCDID
kubectl drain $MEMBER --delete-emptydir-data --force --ignore-daemonsets
kubectl delete node $MEMBER
ssh root@$MEMBER "systemctl stop kubelet ; killall kube-scheduler ; killall kube-controller-manager ; killall kube-apiserver ; killall kube-proxy ; killall kubelet ; killall etcd ; killall csi-node-driver-registrar ; killall cilium-envoy ; killall cilium-agent ; killall cilium-operator ; killall cilium-health-responder ; killall cilium-etcd-operator ; killall cilium-etcd " Next time I joined with rlemeill@a******e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI member list
d19d13a36f304930, started, mikado, https://192.168.2.8:2380, https://192.168.2.8:2379, false
d464065f53ed8f0d, started, a***********e, https://10.254.0.4:2380, https://10.254.0.4:2379, false
rlemeill@a******e:~$ echo $ETCDCTL_PKI
--cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key rlemeill@a*******e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.2.8:2379 | d19d13a36f304930 | 3.5.10 | 163 MB | false | false | 4 | 9479541 | 9479541 | |
| https://10.254.0.4:2379 | d464065f53ed8f0d | 3.5.10 | 163 MB | true | false | 4 | 9479542 | 9479542 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |
It might be also useful to issue a defrag before joining, rlemeill@a********e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI defrag --cluster
Finished defragmenting etcd member[https://192.168.2.8:2379]
Finished defragmenting etcd member[https://10.254.0.4:2379]
rlemeill@a**********e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.2.8:2379 | d19d13a36f304930 | 3.5.10 | 23 MB | false | false | 4 | 9482093 | 9482093 | |
| https://10.254.0.4:2379 | d464065f53ed8f0d | 3.5.10 | 23 MB | true | false | 4 | 9482093 | 9482093 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |
Please raise a ticket (and PR) in the etcd community, thx.
It might be a temporary issue, e.g. network issue ? |
i don't think we should do that from kubeadm. uses should follow external docs in case they want to perform additional ops on the etcd cluster. thanks for confirming it now works, but it seems there are no other AIs for kubeadm here. the potential % change can be tracked in separate etcd/kubeadm tickets. |
/kind support we can increase the timeout to > 2min, but that's a lot of time already and v1beta4 makes it customizable. |
FYI. etcd-io/etcd#17288 is addressing #2997 (comment) |
thanks @ahrtr |
What happened?
On an existing 1.29.0 Kubernetes with 1 control plane and 6 workers, adding second control plane fail at the control-plane-join/etcd d phase
What did you expect to happen?
A second control plane should be active
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
This is due to Cilium kube-proxy replacement
kubeadm-config.yaml with
kubectl -n kube-system get cm kubeadm-config -o yaml
etcd member list
sudo ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: