Race condition when joining nodes #1319
Labels
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/active
Indicates that an issue or PR is actively being worked on by a contributor.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):Previous versions might also be affected.
Environment:
kubectl version
):What happened?
Upon joining a node to an existing cluster, I got an error:
The node actually joins the cluster because the kubelet will properly perform the TLS bootstrap, but the kubeadm join process does not continue and the annotation for the cri socket is missing; as well as further actions that would be performed after this check (if it's not a control plane there are no more actions, as
BootstrapKubelet()
is the last task to be performed).What you expected to happen?
Kubeadm finishes the join process successfully, and the annotation for the cri socket is present in the node.
How to reproduce it (as minimally and precisely as possible)?
Create a master with
kubeadm init
and then join a new machine using the providedkubeadm join
command output from the init process.This is a race condition, and happens only sometimes.
Anything else we need to know?
I am working on this issue at the moment, and I will prepare a PR very soon.
The rationale behind the problem is the following:
ClientSetFromFile()
.How we get into this:
Right before we call to this function (
ClientSetFromFile()
), we are polling for thekubelet
kubeconfig file to be present in the disk.waitForTLSBootstrappedClient()
is only polling for the kubeconfig file for the kubelet.When we are done with the polling (the
kubeconfig
file exists in the disk) we might continue with the join process, hence callingClientSetFromFile()
, but down the rabbit hole, this function will call toToClientSet()
that in turn will call toClientConfig()
.ClientConfig()
will then perform some checks and will call toConfirmUsable()
.ConfirmUsable()
will call tovalidateAuthInfo()
and the validation errors will be returned if any of the client certificate or key cannot be found in the disk.I see basically three options:
We wait on
join.go
for the explicit kubelet client certificate and key too (hardcoded paths in kubeadm), apart from the kubeconfig.We unmarshal the kubeconfig file after we have acknowledged it's available and perform manually the check about the certificate and key.
client-go
(likegetAuthInfo()
).Include
kubeconfigutil.ClientSetFromFile()
in the polling (insidewaitForTLSBootstrappedClient()
) and after the kubeconfig is available. This would ensure that either: a) We timeout here if the Kubelet could not correctly do the TLS bootstrap, or b) The next step in the join process will succeed.What do you think? Please confirm what solution you'd prefer and I will open a PR.
The text was updated successfully, but these errors were encountered: