Description
What keywords did you search in kubeadm issues before filing this one?
etcd join race
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use kubeadm version
):
kubeadm version: &version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:18:19Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:19:15Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:16:41Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
NAME="VMware Photon OS"
VERSION="3.0"
ID=photon
VERSION_ID=3.0
PRETTY_NAME="VMware Photon OS/Linux" - Kernel (e.g.
uname -a
):
Linux 42332700031c0a2cfd7ef78ebc8af387 4.19.84-1.ph3-esx kubeadm join on slave node fails preflight checks #1-photon SMP Tue Nov 19 00:39:50 UTC 2019 x86_64 GNU/Linux - Others:
This should happen in any env if our analysis is valid.
What happened?
For setting up a 3 (or multi)-node cluster, we understand that the etcd learner is coming but we can't refactor our code for now to adopt that. And we really need to do this concurrently. So wondering if any workaround/suggestion can be offered.
We use kubeadm to boostrap a 3-CP cluster, however, we need to inject some customization along the way, so we are calling the join phase one by one instead of simply kubeadm join
.
This issue only happens only when adding the 3rd member.
When we call this command on the 3rd node
kubeadm join phase control-plane-join etcd
Very rarely we observed that the generated etcd manifest (/etc/kubernetes/manifest/etcd.yaml) has incorrect --initial-cluster
value.
Assuming etcd-0 is the first member, etcd-1 is the second and etcd-2 is the third. A correct --initial-cluster
value for etcd-2 might look like this
--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-1=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380
However, in this rare case, we are getting something like this
--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-2=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380
Basically the name of etcd-1 was incorrectly configured as etcd-2, this incorrect manifest results in etcd container failed to start and complain:
etcdmain: error validating peerURLs {"ClusterID":"31c63dd3d7c3da6a","Members":[{"ID":"1b98ed58f9be3e7d","RaftAttributes":{"PeerURLs":["https://192.168.0.2:2380"]},"Attributes":{"Name":"etcd-1","ClientURLs":["https://192.168.0.2:2379"]}},{"ID":"6d631ff1c84da117","RaftAttributes":{"PeerURLs":["https://192.168.0.3:2380"]},"Attributes":{"Name":"","ClientURLs":[]}},{"ID":"f0c11b3401371571","RaftAttributes":{"PeerURLs":["https://192.168.0.1:2380"]},"Attributes":{"Name":"etcd-0","ClientURLs":["https://192.168.0.1:2379"]}}],"RemovedMemberIDs":[]}: member count is unequal\n","stream":"stderr","time":"2020-01-08T17:27:52.63704563Z"}
We think this error message was because the manifest --initial-cluster
has only 2 unique names there while the etcd cluster actually has 3 members.
We spent some time tracking the code to see what could be the issue and we have a theory here.
-
Calling
kubeadm join phase control-plane-join etcd
-
Then the above command calls this
-
It then calls etcdClient.AddMember()
-
func (c *Client) AddMember(name string, peerAddrs string)
here name is the current master's Name and peerAddrs is the current master's peerURL.
Then in L290: resp, err = cli.MemberAdd(ctx, []string{peerAddrs})
it calls the real MemberAdd which will return a []Member
that includes the currently-being-added one.
So the response of this MemberAdd()
will have all previous members and current member.
Once AddMember() receives the response
for _, m := range resp.Members {
// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
if m.Name == "" {
ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
} else {
ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
}
}
Here resp is the response from MemberAdd() as described above. And this section is to insert the given Name for the members that do not have Name
. We think that it is expected only the currently-being-added member that should be the only one that does not have Name
, it loops the resp.Members
, find the member that does not have a Name
and set name
as the member Name
.
But if the resp.Members, which in this case returned 3 members (because this happens in the 3rd member), there were somehow 2 Members that do not have "Name" because the 2nd member had just joined the etcd cluster, but the etcd container of 2nd is still coming up, in this case, "etcdctl member list" would return something like
cat ../../../../commands/etcdctl_member-list.txt
1b98ed58f9be3e7d, started, etcd-0, https://20.20.0.37:2380, https://192.168.0.1:2379
6d631ff1c84da117, unstarted, , https://192.168.0.2:2380, <-- this is etcd-1, but not started yet so no Name
In this case, there are 2 out of 3 Members that do not have Name
, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.
We concluded that this issue only happens if the 3rd member that is running MemberAdd
during which the 2nd Member is not yet started which is considered racy.
For this ticket, we want to understand that:
- Is this analysis valid? Could this really happen?
- What could be the fix here? We can fix it in our private repo and wait for adoption of etcd learner.
What you expected to happen?
The generated etcd manifest should have correct --initial-cluster
value.
How to reproduce it (as minimally and precisely as possible)?
- Create a single-node k8s cluster
- Prepare 2 other nodes, not joining them yet
- Join them concurrently
Note that this happens really rarely, the frequency is probably 1/1000