Unable to add a second control plane due to control-plane-join/etcd unable to start etcd as learner member #2997

eltorio · 2024-01-17T12:49:35Z

What happened?

On an existing 1.29.0 Kubernetes with 1 control plane and 6 workers, adding second control plane fail at the control-plane-join/etcd d phase

ca-bundle.crt                                                                                                                                                                                                               100% 4623   301.8KB/s   00:00    
ca.crt                                                                                                                                                                                                                      100% 2346   153.1KB/s   00:00    
ca.key                                                                                                                                                                                                                      100% 3272   212.7KB/s   00:00    
front-proxy-ca-bundle.crt                                                                                                                                                                                                   100% 4659   293.7KB/s   00:00    
front-proxy-ca.crt                                                                                                                                                                                                          100% 2382   146.9KB/s   00:00    
front-proxy-ca.key                                                                                                                                                                                                          100% 3272   207.4KB/s   00:00    
lesmuidsrootcaV4.pem                                                                                                                                                                                                        100% 2277   148.1KB/s   00:00    
sa.key                                                                                                                                                                                                                      100% 1675   109.4KB/s   00:00    
sa.pub                                                                                                                                                                                                                      100%  451    30.2KB/s   00:00    
ca-bundle.crt                                                                                                                                                                                                               100% 4643   293.4KB/s   00:00    
ca.crt                                                                                                                                                                                                                      100% 2366   154.9KB/s   00:00    
ca.key                                                                                                                                                                                                                      100% 3272   212.5KB/s   00:00    
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W0117 13:37:05.477615   76125 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" is forbidden: User "system:bootstrap:w00451" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
W0117 13:37:05.658415   76125 checks.go:835] detected that the sandbox image "registry.k8s.io/pause:3.6" of the container runtime is inconsistent with that used by kubeadm. It is recommended that using "registry.k8s.io/pause:3.9" as the CRI sandbox image.
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using the existing "etcd/server" certificate and key
[certs] Using the existing "etcd/peer" certificate and key
[certs] Using the existing "etcd/healthcheck-client" certificate and key
[certs] Using the existing "apiserver-etcd-client" certificate and key
[certs] Using the existing "apiserver-kubelet-client" certificate and key
[certs] Using the existing "apiserver" certificate and key
[certs] Using the existing "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
{"level":"warn","ts":"2024-01-17T13:37:11.110625+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:11.232648+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:11.40378+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:11.664601+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:12.025234+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:12.549667+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:13.391455+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:14.585657+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:16.338968+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:18.945127+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:22.918473+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:28.913605+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:37:37.742554+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
[kubelet-check] Initial timeout of 40s passed.
{"level":"warn","ts":"2024-01-17T13:37:50.998296+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:38:11.165013+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:38:40.648812+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:39:24.657646+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
{"level":"warn","ts":"2024-01-17T13:40:31.095712+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: can only promote a learner member which is in sync with leader
To see the stack trace of this error execute with --v=5 or higher

What did you expect to happen?

A second control plane should be active

How can we reproduce it (as minimally and precisely as possible)?

#!/bin/bash
MEMBER_NAME='new-member'
MEMBER_IP=$(sed -n "s/^\(.*\) $MEMBER_NAME.*$/\1/p" /etc/hosts)
CONTROLPLANE=$(kubectl get nodes -l node-role.kubernetes.io/control-plane -o name | sed -n "s/^node\/\(.*\)/\1/p")
CONTROLPLANE_IP=$(kubectl get nodes -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
TOKEN=$(sudo kubeadm token create --print-join-command | awk '{print $5}')
DISCOVERY_TOKEN_HASH=$(sudo kubeadm token create --print-join-command | awk '{print $7}')
ssh root@$MEMBER_NAME "mkdir -p /etc/kubernetes/pki/etcd"
sudo scp -r /etc/kubernetes/pki/*ca*.* root@$MEMBER_NAME:/etc/kubernetes/pki/
sudo scp -r /etc/kubernetes/pki/sa.* root@$MEMBER_NAME:/etc/kubernetes/pki/
sudo scp -r /etc/kubernetes/pki/etcd/*ca*.* root@$MEMBER_NAME:/etc/kubernetes/pki/etcd/
ssh root@$MEMBER_NAME "kubeadm join $CONTROLPLANE_IP:6443 \
            --token $TOKEN \
            --discovery-token-ca-cert-hash $DISCOVERY_TOKEN_HASH \
            --cri-socket=$CRI_SOCKET $KUBEADM_EXTRA_ARGS \
            --control-plane --apiserver-advertise-address $MEMBER_IP --v=5"

Anything else we need to know?

W0117 13:37:05.477615   76125 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded.

This is due to Cilium kube-proxy replacement

kubeadm-config.yaml with kubectl -n kube-system get cm kubeadm-config -o yaml

apiVersion: v1
data:
  ClusterConfiguration: |
    apiServer:
      extraArgs:
        authorization-mode: Node,RBAC
      timeoutForControlPlane: 4m0s
    apiVersion: kubeadm.k8s.io/v1beta3
    certificatesDir: /etc/kubernetes/pki
    clusterName: kubernetes
    controlPlaneEndpoint: ***********:6443
    controllerManager: {}
    dns: {}
    etcd:
      local:
        dataDir: /var/lib/etcd
    imageRepository: registry.k8s.io
    kind: ClusterConfiguration
    kubernetesVersion: v1.29.0
    networking:
      dnsDomain: cluster.local
      podSubnet: 172.28.0.0/14,2001:***********::/64
      serviceSubnet: 172.24.0.0/14,2001:************4::/108
    scheduler: {}
kind: ConfigMap
metadata:
  creationTimestamp: "2023-12-23T19:05:32Z"
  name: kubeadm-config
  namespace: kube-system
  resourceVersion: "202"
  uid: 082e05e0-e758-4019-9da8-abecd68b8331

etcd member list sudo ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list

d4********f0d, started, a*****************e, https://10.254.0.4:2380, https://10.254.0.4:2379, false

Kubernetes version

$ kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0

Cloud provider

Bare metal cluster on: - 1 arm64 24GB Ram for control-plane - 6 arm64 24GB Ram for workers - 1 amd64 16GB Ram for new control-plane

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
$ uname -a
Linux ********* 6.2.0-1018-azure kubernetes/kubernetes#18~22.04.1-Ubuntu SMP Wed Nov 22 00:23:21 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

containerd containerd.io 1.6.27 a1496014c916f9e62104b33d1bb5bd03b0858e59

Related plugins (CNI, CSI, ...) and versions (if applicable)

Cilium CNI plugin 1.14.5 85db28be 2023-12-11T14:30:29+01:00 go version go1.20.12 linux/amd64 CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-01-17T12:49:43Z

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-01-17T12:49:44Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2024-01-17T12:53:03Z

/transfer kubeadm

neolit123 · 2024-01-17T13:02:30Z

W0117 13:37:05.477615 76125 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded.

any more context about this?

{"level":"warn","ts":"2024-01-17T13:40:31.095712+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}

looks like the learning member never syncs with the existing leader member.
any logs from etcd in terms of why this can happen?

CONTROLPLANE_IP=$(kubectl get nodes -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

this i don't like. you seem to be using the IP of the existing control plane machine to join the second control plane machine.
instead you should be using an LB.

doing the switch now is not trivial...

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#considerations-about-apiserver-advertise-address-and-controlplaneendpoint
https://github.com/kubernetes/kubeadm/blob/main/docs/ha-considerations.md

the above is my only suspect on why you might be getting this strange error.
also note that the etcd HA cluster does not use the LB, the kubeadm machines must be able to ping each other for the etcd members to be able to create an overlay cluster.

FWIW, we have e2e tests about learner mode and it's already used in production by users, as it's enabled by default in kubeadm 1.29.

neolit123 · 2024-01-17T13:02:35Z

cc @pacoxu

pacoxu · 2024-01-17T13:15:35Z

I wondered if you can join successfully with EtcdLearnerMode disabled in this case.

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
featureGates:
  EtcdLearnerMode: false

The

{"level":"warn","ts":"2024-01-17T13:40:31.095712+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001d1500/10.254.0.4:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}

looks like the learning member never syncs with the existing leader member. any logs from etcd in terms of why this can happen?

Can you check the etcd logs from the two nodes? Sync failed makes me think the time is not synced(do you use ntp or chronyd to make the time the same.).(only a guess).

eltorio · 2024-01-17T13:21:52Z

the new node can reach the master etcd:

openssl s_client -connect 10.254.0.4:2380 --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = a**************e
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = a**************e
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN = a**************e
verify return:1
---
Certificate chain
 0 s:CN = a**************e
   i:C = FR, ST = Bouches du Rh\C3\83\C2\B4ne, L = Marseille, O =***********, OU = Admin, CN = ********-etcd-CA, emailAddress = admin-ca@*********
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 19 08:42:18 2023 GMT; NotAfter: Dec 22 19:05:22 2024 GMT

I know using a LB is the good way, but it should working without.
etcd log shows only normal log

@pacoxu with EtcdLearnerMode: false the new node join successfully
@pacoxu ntp is enabled on each node

pacoxu · 2024-01-17T13:44:15Z

This is more like an etcd problem and we may need some help from etcd maintainers.

/sig etcd
/cc @ahrtr

https://etcd.io/docs/v3.4/op-guide/runtime-configuration/#error-cases-when-promoting-a-learner-member
The error is quite straightforward that it means Learner can only be promoted to voting member if it is in sync with leader.

ahrtr · 2024-01-17T14:20:52Z

It's simple, it just means that the learner isn't in sync with the leader's data. Specifically, learner's index < 90% of leader's index.
https://github.com/etcd-io/etcd/blob/e3c70c8a99401a417c7eb4ba4d805d6038b4707a/server/etcdserver/server.go#L1549-L1552

Please just execute the following two commands and share the output,

etcdctl member list -w table
etcdctl endpoint status -w table --cluster

pacoxu · 2024-01-19T05:42:55Z

https://github.com/kubernetes/kubernetes/blob/eb1ae05cf040346bdb197490ef74ed929fdf60b7/cmd/kubeadm/app/util/etcd/etcd.go#L526-L545

The timeout is 2 minute to waiting for learner to become ready.

Is this long enough for etcd syning?

This will be configurable with v1beta4(which is not released/enabled yet).

neolit123 · 2024-01-19T06:07:17Z

@eltorio please reply to this when possible #2997 (comment)

The timeout is 2 minute to waiting for learner to become ready.

normally it gets ready within seconds. e.g. 10

pacoxu · 2024-01-19T06:45:48Z

It might be better to include a "match" percentage in the log so that the user can see the progress more clearly.

neolit123 · 2024-01-19T07:42:33Z

It might be better to include a "match" percentage in the log so that the user can see the progress more clearly.

if technically possible, doesn’t seem like a bad idea as long as this % progresses well and is accurate on the output feed.

eltorio · 2024-01-19T09:01:55Z

It's simple, it just means that the learner isn't in sync with the leader's data. Specifically, learner's index < 90% of leader's index. https://github.com/etcd-io/etcd/blob/e3c70c8a99401a417c7eb4ba4d805d6038b4707a/server/etcdserver/server.go#L1549-L1552

Please just execute the following two commands and share the output,
etcdctl member list -w table
etcdctl endpoint status -w table --cluster

@neolit123 unfortunatly I see your message only now…
I cannot give you the failed output since now it is working.
How I did:

I join the new control plane with EtcdLearnerMode: false next I issue (copied from my .bash_history):

MEMBER=mikado
ETCDID=$(cluster_etcd_get_member_list | sed -n "s/^\([a-f0-9]*\), .*, $MEMBER.*/\1/p")
sudo ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove $ETCDID
kubectl drain $MEMBER --delete-emptydir-data --force --ignore-daemonsets
kubectl delete node $MEMBER
ssh root@$MEMBER "systemctl stop kubelet ; killall kube-scheduler ; killall kube-controller-manager ; killall kube-apiserver ; killall kube-proxy ; killall kubelet ; killall etcd ; killall csi-node-driver-registrar ; killall cilium-envoy ; killall cilium-agent ; killall cilium-operator ; killall cilium-health-responder ; killall cilium-etcd-operator ; killall cilium-etcd "

Next time I joined with EtcdLearnerMode: true and it works in seconds :)
Now I have two members:

rlemeill@a******e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI member list
d19d13a36f304930, started, mikado, https://192.168.2.8:2380, https://192.168.2.8:2379, false
d464065f53ed8f0d, started, a***********e, https://10.254.0.4:2380, https://10.254.0.4:2379, false
rlemeill@a******e:~$ echo $ETCDCTL_PKI 
--cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key

rlemeill@a*******e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.2.8:2379 | d19d13a36f304930 |  3.5.10 |  163 MB |     false |      false |         4 |    9479541 |            9479541 |        |
|  https://10.254.0.4:2379 | d464065f53ed8f0d |  3.5.10 |  163 MB |      true |      false |         4 |    9479542 |            9479542 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

eltorio · 2024-01-19T09:14:24Z

It might be better to include a "match" percentage in the log so that the user can see the progress more clearly.

if technically possible, doesn’t seem like a bad idea as long as this % progresses well and is accurate on the output feed.

It might be also useful to issue a defrag before joining,
For example my db size was 163 MB before defrag and now it is only 23 MB

rlemeill@a********e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI defrag --cluster
Finished defragmenting etcd member[https://192.168.2.8:2379]
Finished defragmenting etcd member[https://10.254.0.4:2379]
rlemeill@a**********e:~$ sudo ETCDCTL_API=3 etcdctl $ETCDCTL_PKI endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.2.8:2379 | d19d13a36f304930 |  3.5.10 |   23 MB |     false |      false |         4 |    9482093 |            9482093 |        |
|  https://10.254.0.4:2379 | d464065f53ed8f0d |  3.5.10 |   23 MB |      true |      false |         4 |    9482093 |            9482093 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

ahrtr · 2024-01-19T09:39:44Z

It might be better to include a "match" percentage in the log so that the user can see the progress more clearly.

Please raise a ticket (and PR) in the etcd community, thx.

I cannot give you the failed output since now it is working.

It might be a temporary issue, e.g. network issue ?

neolit123 · 2024-01-19T10:11:51Z

It might be also useful to issue a defrag before joining,

i don't think we should do that from kubeadm. uses should follow external docs in case they want to perform additional ops on the etcd cluster.

thanks for confirming it now works, but it seems there are no other AIs for kubeadm here.

the potential % change can be tracked in separate etcd/kubeadm tickets.

neolit123 · 2024-01-19T13:37:58Z

/kind support
seems like a network timeout / flake.
please reopen if more discussion is needed here.

we can increase the timeout to > 2min, but that's a lot of time already and v1beta4 makes it customizable.

ahrtr · 2024-01-30T10:19:01Z

FYI. etcd-io/etcd#17288 is addressing #2997 (comment)

neolit123 · 2024-01-30T11:07:34Z

thanks @ahrtr

eltorio added the kind/bug Categorizes issue or PR as related to a bug. label Jan 17, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 17, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 17, 2024

k8s-ci-robot transferred this issue from kubernetes/kubernetes Jan 17, 2024

neolit123 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jan 17, 2024

neolit123 added this to the v1.30 milestone Jan 17, 2024

k8s-ci-robot added the sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. label Jan 17, 2024

neolit123 closed this as completed Jan 19, 2024

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jan 19, 2024

ishan16696 mentioned this issue Jan 21, 2024

Added a error log when learner is not sync with etcd leader. etcd-io/etcd#17288

Merged

pacoxu mentioned this issue Jan 29, 2024

RFE: use learner mode for joining etcd members #1793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to add a second control plane due to control-plane-join/etcd unable to start etcd as learner member #2997

Unable to add a second control plane due to control-plane-join/etcd unable to start etcd as learner member #2997

eltorio commented Jan 17, 2024 •

edited

Loading

k8s-ci-robot commented Jan 17, 2024

k8s-ci-robot commented Jan 17, 2024

neolit123 commented Jan 17, 2024

neolit123 commented Jan 17, 2024 •

edited

Loading

neolit123 commented Jan 17, 2024

pacoxu commented Jan 17, 2024

eltorio commented Jan 17, 2024 •

edited

Loading

pacoxu commented Jan 17, 2024

ahrtr commented Jan 17, 2024

pacoxu commented Jan 19, 2024

neolit123 commented Jan 19, 2024 •

edited

Loading

pacoxu commented Jan 19, 2024

neolit123 commented Jan 19, 2024

eltorio commented Jan 19, 2024 •

edited

Loading

eltorio commented Jan 19, 2024

ahrtr commented Jan 19, 2024

neolit123 commented Jan 19, 2024 •

edited

Loading

neolit123 commented Jan 19, 2024

ahrtr commented Jan 30, 2024

neolit123 commented Jan 30, 2024

Unable to add a second control plane due to control-plane-join/etcd unable to start etcd as learner member #2997

Unable to add a second control plane due to control-plane-join/etcd unable to start etcd as learner member #2997

Comments

eltorio commented Jan 17, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Jan 17, 2024

k8s-ci-robot commented Jan 17, 2024

neolit123 commented Jan 17, 2024

neolit123 commented Jan 17, 2024 • edited Loading

neolit123 commented Jan 17, 2024

pacoxu commented Jan 17, 2024

eltorio commented Jan 17, 2024 • edited Loading

pacoxu commented Jan 17, 2024

ahrtr commented Jan 17, 2024

pacoxu commented Jan 19, 2024

neolit123 commented Jan 19, 2024 • edited Loading

pacoxu commented Jan 19, 2024

neolit123 commented Jan 19, 2024

eltorio commented Jan 19, 2024 • edited Loading

eltorio commented Jan 19, 2024

ahrtr commented Jan 19, 2024

neolit123 commented Jan 19, 2024 • edited Loading

neolit123 commented Jan 19, 2024

ahrtr commented Jan 30, 2024

neolit123 commented Jan 30, 2024

eltorio commented Jan 17, 2024 •

edited

Loading

neolit123 commented Jan 17, 2024 •

edited

Loading

eltorio commented Jan 17, 2024 •

edited

Loading

neolit123 commented Jan 19, 2024 •

edited

Loading

eltorio commented Jan 19, 2024 •

edited

Loading

neolit123 commented Jan 19, 2024 •

edited

Loading