Skip to content

Concurrent CP join race condition for etcd join #2005

Closed
kubernetes/kubernetes
#87505
@echu23

Description

@echu23

What keywords did you search in kubeadm issues before filing this one?

etcd join race

I did find #1319 #2001 #1793

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:18:19Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:19:15Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:16:41Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
    NAME="VMware Photon OS"
    VERSION="3.0"
    ID=photon
    VERSION_ID=3.0
    PRETTY_NAME="VMware Photon OS/Linux"
  • Kernel (e.g. uname -a):
    Linux 42332700031c0a2cfd7ef78ebc8af387 4.19.84-1.ph3-esx kubeadm join on slave node fails preflight checks #1-photon SMP Tue Nov 19 00:39:50 UTC 2019 x86_64 GNU/Linux
  • Others:

This should happen in any env if our analysis is valid.

What happened?

For setting up a 3 (or multi)-node cluster, we understand that the etcd learner is coming but we can't refactor our code for now to adopt that. And we really need to do this concurrently. So wondering if any workaround/suggestion can be offered.

We use kubeadm to boostrap a 3-CP cluster, however, we need to inject some customization along the way, so we are calling the join phase one by one instead of simply kubeadm join.
This issue only happens only when adding the 3rd member.

When we call this command on the 3rd node

kubeadm join phase control-plane-join etcd

Very rarely we observed that the generated etcd manifest (/etc/kubernetes/manifest/etcd.yaml) has incorrect --initial-cluster value.

Assuming etcd-0 is the first member, etcd-1 is the second and etcd-2 is the third. A correct --initial-cluster value for etcd-2 might look like this

--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-1=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380

However, in this rare case, we are getting something like this

--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-2=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380

Basically the name of etcd-1 was incorrectly configured as etcd-2, this incorrect manifest results in etcd container failed to start and complain:

etcdmain: error validating peerURLs {"ClusterID":"31c63dd3d7c3da6a","Members":[{"ID":"1b98ed58f9be3e7d","RaftAttributes":{"PeerURLs":["https://192.168.0.2:2380"]},"Attributes":{"Name":"etcd-1","ClientURLs":["https://192.168.0.2:2379"]}},{"ID":"6d631ff1c84da117","RaftAttributes":{"PeerURLs":["https://192.168.0.3:2380"]},"Attributes":{"Name":"","ClientURLs":[]}},{"ID":"f0c11b3401371571","RaftAttributes":{"PeerURLs":["https://192.168.0.1:2380"]},"Attributes":{"Name":"etcd-0","ClientURLs":["https://192.168.0.1:2379"]}}],"RemovedMemberIDs":[]}: member count is unequal\n","stream":"stderr","time":"2020-01-08T17:27:52.63704563Z"}

We think this error message was because the manifest --initial-cluster has only 2 unique names there while the etcd cluster actually has 3 members.

We spent some time tracking the code to see what could be the issue and we have a theory here.

  1. Calling kubeadm join phase control-plane-join etcd

  2. Then the above command calls this

  3. It then calls etcdClient.AddMember()

  4. func (c *Client) AddMember(name string, peerAddrs string) here name is the current master's Name and peerAddrs is the current master's peerURL.

Then in L290: resp, err = cli.MemberAdd(ctx, []string{peerAddrs})

it calls the real MemberAdd which will return a []Member that includes the currently-being-added one.

So the response of this MemberAdd() will have all previous members and current member.

Once AddMember() receives the response

 for _, m := range resp.Members {
  // fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
  if m.Name == "" {
   ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
  } else {
   ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
  }
 }

Here resp is the response from MemberAdd() as described above. And this section is to insert the given Name for the members that do not have Name. We think that it is expected only the currently-being-added member that should be the only one that does not have Name, it loops the resp.Members, find the member that does not have a Name and set name as the member Name.

But if the resp.Members, which in this case returned 3 members (because this happens in the 3rd member), there were somehow 2 Members that do not have "Name" because the 2nd member had just joined the etcd cluster, but the etcd container of 2nd is still coming up, in this case, "etcdctl member list" would return something like

cat ../../../../commands/etcdctl_member-list.txt
1b98ed58f9be3e7d, started, etcd-0, https://20.20.0.37:2380, https://192.168.0.1:2379
6d631ff1c84da117, unstarted, , https://192.168.0.2:2380, <-- this is etcd-1, but not started yet so no Name

In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.

We concluded that this issue only happens if the 3rd member that is running MemberAdd during which the 2nd Member is not yet started which is considered racy.

For this ticket, we want to understand that:

  1. Is this analysis valid? Could this really happen?
  2. What could be the fix here? We can fix it in our private repo and wait for adoption of etcd learner.

What you expected to happen?

The generated etcd manifest should have correct --initial-cluster value.

How to reproduce it (as minimally and precisely as possible)?

  1. Create a single-node k8s cluster
  2. Prepare 2 other nodes, not joining them yet
  3. Join them concurrently

Note that this happens really rarely, the frequency is probably 1/1000

Anything else we need to know?

Metadata

Metadata

Labels

area/HAarea/etcdkind/bugCategorizes issue or PR as related to a bug.kind/designCategorizes issue or PR as related to design.lifecycle/activeIndicates that an issue or PR is actively being worked on by a contributor.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions