Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when joining nodes #1319

Closed
ereslibre opened this issue Dec 13, 2018 · 1 comment · Fixed by kubernetes/kubernetes#72030
Closed

Race condition when joining nodes #1319

ereslibre opened this issue Dec 13, 2018 · 1 comment · Fixed by kubernetes/kubernetes#72030
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ereslibre
Copy link
Contributor

ereslibre commented Dec 13, 2018

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-alpha.0.974+f89aa180476882", GitCommit:"f89aa180476882e3d53883773f9e6988c32bb9ed", GitTreeState:"clean", BuildDate:"2018-12-12T08:46:14Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

Previous versions might also be affected.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-alpha.0.974+f89aa180476882", GitCommit:"f89aa180476882e3d53883773f9e6988c32bb9ed", GitTreeState:"clean", BuildDate:"2018-12-12T08:46:14Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-alpha.0.974+f89aa180476882", GitCommit:"f89aa180476882e3d53883773f9e6988c32bb9ed", GitTreeState:"clean", BuildDate:"2018-12-12T08:46:42Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

What happened?

Upon joining a node to an existing cluster, I got an error:

    kubernetes_worker: [preflight] Running pre-flight checks
    kubernetes_worker: [discovery] Trying to connect to API Server "172.28.128.3:6443"
    kubernetes_worker: [discovery] Created cluster-info discovery client, requesting info from "https://172.28.128.3:6443"
    kubernetes_worker: [discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "172.28.128.3:6443"
    kubernetes_worker: [discovery] Successfully established connection with API Server "172.28.128.3:6443"
    kubernetes_worker: [join] Reading configuration from the cluster...
    kubernetes_worker: [join] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
    kubernetes_worker: [kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.14" ConfigMap in the kube-system namespace
    kubernetes_worker: [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
    kubernetes_worker: [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
    kubernetes_worker: [kubelet-start] Activating the kubelet service
    kubernetes_worker: [tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
    kubernetes_worker: failed to create API client configuration from kubeconfig: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]

The node actually joins the cluster because the kubelet will properly perform the TLS bootstrap, but the kubeadm join process does not continue and the annotation for the cri socket is missing; as well as further actions that would be performed after this check (if it's not a control plane there are no more actions, as BootstrapKubelet() is the last task to be performed).

What you expected to happen?

Kubeadm finishes the join process successfully, and the annotation for the cri socket is present in the node.

How to reproduce it (as minimally and precisely as possible)?

Create a master with kubeadm init and then join a new machine using the provided kubeadm join command output from the init process.

This is a race condition, and happens only sometimes.

Anything else we need to know?

I am working on this issue at the moment, and I will prepare a PR very soon.

The rationale behind the problem is the following:

How we get into this:

I see basically three options:

  1. We wait on join.go for the explicit kubelet client certificate and key too (hardcoded paths in kubeadm), apart from the kubeconfig.

    1. I don't like this solution because if the kubelet changes the location of that information we also need to do it, it's fragile.
  2. We unmarshal the kubeconfig file after we have acknowledged it's available and perform manually the check about the certificate and key.

    1. I also don't like this one too much, because we need to replicate some logic that is already available in client-go (like getAuthInfo()).
    2. I didn't found a straightforward way of saying: give me the credentials (if any) for this kubeconfig without you performing the validation.
  3. Include kubeconfigutil.ClientSetFromFile() in the polling (inside waitForTLSBootstrappedClient()) and after the kubeconfig is available. This would ensure that either: a) We timeout here if the Kubelet could not correctly do the TLS bootstrap, or b) The next step in the join process will succeed.

    1. This is the one that I prefer the most because doesn't make us replicate any logic, we just retry as we do with the kubeconfig file in the first place, but ensuring that we are able to get a client set from the file.

What do you think? Please confirm what solution you'd prefer and I will open a PR.

@rosti
Copy link

rosti commented Dec 13, 2018

Thank you for reporting and working on this @ereslibre !

/kind bug
/priority important-soon
/lifecycle active

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants