Concurrent CP join race condition for etcd join

## What keywords did you search in kubeadm issues before filing this one?

etcd join race

I did find #1319 #2001 #1793 

## Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

## Versions

**kubeadm version** (use `kubeadm version`):

kubeadm version: &version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:18:19Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

**Environment**:
- **Kubernetes version** (use `kubectl version`):

Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:19:15Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:16:41Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

- **Cloud provider or hardware configuration**:
- **OS** (e.g. from /etc/os-release): 
NAME="VMware Photon OS"
VERSION="3.0"
ID=photon
VERSION_ID=3.0
PRETTY_NAME="VMware Photon OS/Linux"
- **Kernel** (e.g. `uname -a`):
Linux 42332700031c0a2cfd7ef78ebc8af387 4.19.84-1.ph3-esx #1-photon SMP Tue Nov 19 00:39:50 UTC 2019 x86_64 GNU/Linux
- **Others**:

This should happen in any env if our analysis is valid.

## What happened?

**For setting up a 3 (or multi)-node cluster, we understand that the etcd learner is coming but we can't refactor our code for now to adopt that. And we really need to do this concurrently. So wondering if any workaround/suggestion can be offered.**

We use kubeadm to boostrap a 3-CP cluster, however, we need to inject some customization along the way, so we are calling the join phase one by one instead of simply `kubeadm join`. 
**This issue only happens only when adding the 3rd member.**

When we call this command on the 3rd node

`kubeadm join phase control-plane-join etcd`

Very rarely we observed that the generated etcd manifest (/etc/kubernetes/manifest/etcd.yaml) has incorrect `--initial-cluster` value. 

Assuming **etcd-0** is the first member, **etcd-1** is the second and **etcd-2** is the third. A correct `--initial-cluster` value for **etcd-2** might look like this

`--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-1=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380`

However, in this rare case, we are getting something like this

`--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-2=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380`

Basically the name of **etcd-1** was incorrectly configured as **etcd-2**, this incorrect manifest results in etcd container failed to start and complain:

`etcdmain: error validating peerURLs {"ClusterID":"31c63dd3d7c3da6a","Members":[{"ID":"1b98ed58f9be3e7d","RaftAttributes":{"PeerURLs":["https://192.168.0.2:2380"]},"Attributes":{"Name":"etcd-1","ClientURLs":["https://192.168.0.2:2379"]}},{"ID":"6d631ff1c84da117","RaftAttributes":{"PeerURLs":["https://192.168.0.3:2380"]},"Attributes":{"Name":"","ClientURLs":[]}},{"ID":"f0c11b3401371571","RaftAttributes":{"PeerURLs":["https://192.168.0.1:2380"]},"Attributes":{"Name":"etcd-0","ClientURLs":["https://192.168.0.1:2379"]}}],"RemovedMemberIDs":[]}: member count is unequal\n","stream":"stderr","time":"2020-01-08T17:27:52.63704563Z"}`

We think this error message was because the manifest `--initial-cluster` has only 2 unique names there while the etcd cluster actually has 3 members. 

We spent some time tracking the code to see what could be the issue and we have a theory here. 

1. Calling `kubeadm join phase control-plane-join etcd`
2. Then the above command calls [this](https://github.com/kubernetes/kubernetes/blob/e302aea142cb765e44c4842f3caebb4437db9dfb/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go#L143) 

3. It then calls [etcdClient.AddMember()](https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/etcd/etcd.go#L263)

4. `func (c *Client) AddMember(name string, peerAddrs string)` here name is the current master's Name and peerAddrs is the current master's peerURL. 

Then in `L290: resp, err = cli.MemberAdd(ctx, []string{peerAddrs})`

it calls the real [MemberAdd](https://github.com/kubernetes/kubernetes/blob/9701baea0ffdb0c47bc592755b1ce81d896baf47/vendor/go.etcd.io/etcd/clientv3/cluster.go#L84) which will return a `[]Member` that includes the currently-being-added one. 

So the response of this `MemberAdd()` will have all previous members and current member. 

Once AddMember() receives the response

```
 for _, m := range resp.Members {
  // fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
  if m.Name == "" {
   ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
  } else {
   ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
  }
 }
```

Here resp is the response from MemberAdd() as described above. And this section is to insert the given Name for the member**s** that do not have `Name`. We think that it is expected only the currently-being-added member that should be the only one that does not have `Name`, it loops the `resp.Members`, find the member that does not have a `Name` and set `name` as the member `Name`. 

But if the resp.Members, which in this case returned 3 members (because this happens in the 3rd member), there were somehow 2 Members that do not have "Name" because the 2nd member had just joined the etcd cluster, but the etcd container of 2nd is still coming up, in this case, "etcdctl member list" would return something like 

cat ../../../../commands/etcdctl_member-list.txt
1b98ed58f9be3e7d, started, etcd-0, https://20.20.0.37:2380, https://192.168.0.1:2379
6d631ff1c84da117, unstarted, , https://192.168.0.2:2380, <-- this is etcd-1, but not started yet so no `Name`

In this case, there are 2 out of 3 Members that do not have `Name`, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member. 

We concluded that this issue only happens if the 3rd member that is running `MemberAdd` during which the 2nd Member is not yet started which is considered racy. 

For this ticket, we want to understand that:

1. Is this analysis valid? Could this really happen?
2. What could be the fix here? We can fix it in our private repo and wait for adoption of etcd learner. 

## What you expected to happen?

The generated etcd manifest should have correct `--initial-cluster` value.

## How to reproduce it (as minimally and precisely as possible)?

1. Create a single-node k8s cluster
2. Prepare 2 other nodes, not joining them yet
3. Join them concurrently

**Note that this happens really rarely, the frequency is probably 1/1000**

## Anything else we need to know?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concurrent CP join race condition for etcd join #2005

What keywords did you search in kubeadm issues before filing this one?

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concurrent CP join race condition for etcd join #2005

Description

What keywords did you search in kubeadm issues before filing this one?

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions