Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional phases and subphases in kubeadm init/join #2457

Closed
micahhausler opened this issue Apr 22, 2021 · 5 comments
Closed

Additional phases and subphases in kubeadm init/join #2457

micahhausler opened this issue Apr 22, 2021 · 5 comments
Labels
area/phases kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@micahhausler
Copy link
Member

micahhausler commented Apr 22, 2021

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST
/kind feature
/area phases

We're prototyping support for joining Bottlerocket nodes in cluster-api clusters, and have encountered the need for finer-grained control of the starting the kubernetes control plane. (Bottlerocket is a free and open-source Linux-based operating system meant for hosting containers)

Bottlerocket omits many components such as a package manager, SSH, shell, and instead relies on user-specified containers ("host containers") for initial configuration. It does not include cloud-config but rather parses a TOML based user-data boot script for configuration. Configuration values, including Kubelet settings, can be set or accessed via a local API.

Bottlerocket additionally doesn't allow writes to /etc by most processes (including host containers), but the Bottlerocket API supports updating some of the kubelet settings file and kubeconfig stored in /etc. Because of a read-only /etc, we've had to override several paths that kubeadm writes in static pod manifests (paths for /etc/kubernetes/pki, /etc/kubernetes/{scheduler,controller-manager}.conf to something writable) and some other hacky workarounds. We'd really like the bootstrap process to be cleaner and not rely on failures and timeouts, and we think adding a few more phases under kubeadm join would really help us out without creating anything provider-specific.

Here are roughly the current steps we have to take to create an initial Control Plane node:

  • Symlink /etc/kubernetes in a host container to the writable /var/lib/kubeadm on the host (/.bottlerocket/rootfs/var/lib/kubeadm in the host-container)
  • kubeadm init phase certs all
  • kubeadm init phase kubeconfig all
  • kubeadm init phase control-plane all
  • kubeadm init phase etcd local
  • Read static pod manifests from the container's /etc/kubernetes/manifests and use the Bottlerocket apiclient to write them to the host’s /etc/kubernetes/manifests
    apiclient set \
      "kubernetes.static-pods.${pod}.manifest"="$(base64 -w 0 /etc/kubernetes/manifests/${pod}.yaml)" \
      "kubernetes.static-pods.${pod}.enabled"=true    
    
  • Sleep until kube-apiserver is up (hacky)
  • kubeadm init phase bootstrap-token
  • kubeadm token create
  • Use the Bottlerocket apiclient to set API endpiont, CA data, bootstrap token, etc for the kubelet
    apiclient set \
      kubernetes.api-server="${API}" \
      kubernetes.cluster-certificate="${CA}" \
      kubernetes.cluster-dns-ip="${DNS}" \
      kubernetes.bootstrap-token="${TOKEN}" \
      kubernetes.authentication-mode="tls" \
      kubernetes.standalone-mode="false"
    
  • Sleep ~30s to give the kubelet time to join (hacky)
  • kubeadm init --skip-phases preflight,kubelet-start,certs,kubeconfig,bootstrap-token,control-plane,etcd
  • Use the Bottlerocket apiclient to set the cluster DNS IP for kubelet

Similarly for joining additional Control Plane nodes, we have to:

  • Symlink /etc/kubernetes in a host container to the writable /var/lib/kubeadm on the host (/.bottlerocket/rootfs/var/lib/kubeadm in the host-container)
  • timeout 10 kubeadm join --skip-phases preflight (which writes out keys, certs, static pod manifests, and kubelet config, but fails because static pods aren’t really written)
  • Use the Bottlerocket apiclient to set API endpiont, CA data, bootstrap token, etc for the kubelet
  • Sleep ~30s to give the kubelet time to join (hacky)
  • Use the Bottlerocket apiclient to write static pod manifests read from /var/lib/kubeadm/manifests
  • Sleep ~30s to let the static pods come up (hacky)
  • kubeadm join --skip-phases preflight,control-plane-prepare,kubelet-start to write out the etcd manifest, after which kubeadm will fail because the static pod hasn’t been written
  • Use the Bottlerocket apiclient to write etcd static pod manifest
  • Run kubeadm join --skip-phases preflight,control-plane-prepare,kubelet-start,control-plane-join/etcd and succeed

And then for a worker node join we have to:

  • Symlink /etc/kubernetes in a host container to the writable /var/lib/kubeadm on the host (/.bottlerocket/rootfs/var/lib/kubeadm in the host-container)
  • timeout 10 kubeadm join —skip-phases preflight which writes out keys, kubelet config to a writeable path, but intentionally fails because the host kubelet kubeconfig isn't actually updated.
  • Use the bottlerocket apiclient to set kubelet kubeconfig and config file options

We think the following additional phases and subphases would really be useful when the host’s /etc ins’t writable to kubeadm.

kubeadm init

In kubeadm init:

  • a discrete subphase to wait on the kube-apiserver to start
  • a discrete subphase to wait on the kubelet to join the cluster

kubeadm join (control plane)

  • Additional subphases to kubeadm join phase kubelet-start for
    • Write kubelet config
    • Write kubelet certs
    • Start kubelet
    • Wait for ready kubelet
  • Additional subphases for kubeadm join phase control-plane-join/etcd
    • Write etcd certs
    • Write etcd configuration
    • Wait for ready etcd

kubeadm join (kubelet only)

Covered by the above request, of separate phases for config/certs/start/wait

Versions

kubeadm version (use kubeadm version): All
Environment:

  • Kubernetes version (use kubectl version): All
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): Bottlerocket
  • Kernel (e.g. uname -a): 5.4.105
  • Others:

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

cc @bcressey @jaxesn @vignesh-goutham

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/phases labels Apr 22, 2021
@neolit123
Copy link
Member

hi @micahhausler
context for others - we briefly discussed this problem during the SIG Cluster Lifecycle meeting last week.

i must point out that i have zero experience with Bottlerocket.
trying to understand more about the problems at hand, it feels like usage of the Bottlerocket apiclient is the main reason for breaking the kubeadm init process into phases and skipping some of them, is that correct?
can you instead create an out of bound mechanism (ALA a state machine) that watches for disk/cluster state and processes bottlerocket apiclient data based on what kubeadm prepares in parallel?

the kubeadm phases are not perfect and we do see occasional requests for changes. this request is by far the largest and most elaborate one to date., yet we must also weight in how many other users would see this change as beneficial and how many users this would break.

as a rule of thumb for phases the general advice is to do this:

# do something out of band like "foo"
kubeadm init --skip-phase foo

and not break the whole init/join commands into all their phases:

kubeadm init phase a ... 
# do something out of band "b"
kubeadm init phase c ... 

this is difficult to support for both kubeadm users and maintainers. because all changes become breaking once you expose all the implementation details of the software.

[1] as an example, if we wish to separate the kubeadm join phase kubelet-start phase into sub-phases, kubelet-start would no longer be able to execute all the sub-phases (they are Cobra sub-commands essentially) and this means we'd need a new all sub-phase / sub-command. this is a breaking change to existing users that previously called kubelet-start on demand.

i can comment on your requests, but also please respond to my state machine idea above.
cc @fabriziopandini @SataQiu @pacoxu for commends too.

In kubeadm init:
a discrete subphase to wait on the kube-apiserver to start
a discrete subphase to wait on the kubelet to join the cluster

kubeadm init actually has a hidden phase to wait for the kubelet and kube-apiserver to report 200 at /healthz:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/init.go#L185
i don't recall why it was hidden...it waits for kube-apiserver and kubelet in parallel, so it's not trivial to split it into 2 subsequent phases.

Additional subphases to kubeadm join phase kubelet-start for
Write kubelet config
Write kubelet certs
Start kubelet
Wait for ready kubelet

complexity explained in [1]
i don't recall why we did not expose these sub-phases in the original design, but i'm recollecting something around the topic of problems around the maintenance of these kubelet implementation details.

Additional subphases for kubeadm join phase control-plane-join/etcd
Write etcd certs
Write etcd configuration
Wait for ready etcd

similar breaking change to [1].

@neolit123 neolit123 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 26, 2021
@neolit123 neolit123 added this to the v1.22 milestone Apr 26, 2021
@micahhausler
Copy link
Member Author

it feels like usage of the Bottlerocket apiclient is the main reason for breaking the kubeadm init process into phases and skipping some of them, is that correct?

👍 You got it.

can you instead create an out of bound mechanism (ALA a state machine) that watches for disk/cluster state and processes bottlerocket apiclient data based on what kubeadm prepares in parallel?

I think this is doable for most of the waiting steps (wait for kubelet to come online, wait for kube-apiserver to come online)

the kubeadm phases are not perfect and we do see occasional requests for changes. this request is by far the largest and most elaborate one to date., yet we must also weight in how many other users would see this change as beneficial and how many users this would break.

this is difficult to support for both kubeadm users and maintainers. because all changes become breaking once you expose all the implementation details of the software.

This all makes sense. I think the most painful parts for us are when we have to wait for kubeadm join to time out when we know it will fail. Right now we're running timeout 10 kubeadm join ... and counting on the fact that everything we want to happen will occur within 10 seconds, but that we don't want Kubeadm to wait 5 minutes for the kubelet or 4 minutes for the API server, or 40 seconds for etcd.

If we know that kubelet/kube-apiserver/etcd will fail, could we either pass a custom timeout for some of those? Or just a fast-fail option?

@neolit123
Copy link
Member

neolit123 commented Apr 28, 2021

This all makes sense. I think the most painful parts for us are when we have to wait for kubeadm join to time out when we know it will fail. Right now we're running timeout 10 kubeadm join ... and counting on the fact that everything we want to happen will occur within 10 seconds, but that we don't want Kubeadm to wait 5 minutes for the kubelet or 4 minutes for the API server, or 40 seconds for etcd.

i think timeout N kubeadm join is actually an OK solution and we do something similar in e2e tests.
but timeout of 10 seconds is probably a bit too short for control-plane nodes and probably should be >1 minute.

If we know that kubelet/kube-apiserver/etcd will fail, could we either pass a custom timeout for some of those? Or just a fast-fail option?

one timeout that can be controlled is ClusterConfiguration.apiServer.timeoutForControlPlane (defaults to 4 minutes) which is the timeout for the status 200 on the api-server /healthz.
https://pkg.go.dev/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2
another configurable timeout is for discovery:
https://pkg.go.dev/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2#Discovery
(cluster-info ConfigMap fetch / validation)

but the kubelet TLS bootstrap and etcd timeouts are not configurable.
i think at some point i proposed that we should expose more of these to be configurable, but if i recall correctly my ideas was overruled with consensus and the argument that most users will not need this. could be wrong about the history here..

we are currently working on v1beta3 and if you'd like we could log a separate issue with a proposal for an API change to control etcd / kubelet tls timeouts, but this would mean these controls would only be possible for kubeadm >1.22. there are already some pending changes here with established priority:
#1796
v1beta2 is locked for changes at this point.

@micahhausler
Copy link
Member Author

but the kubelet TLS bootstrap and etcd timeouts are not configurable.
i think at some point i proposed that we should expose more of these to be configurable, but if i recall correctly my ideas was overruled with consensus and the argument that most users will not need this. could be wrong about the history here..

Additional user(s) reporting for duty with a real world use case 😄

we are currently working on v1beta3 and if you'd like we could log a separate issue with a proposal for an API change to control etcd / kubelet tls timeouts, but this would mean these controls would only be possible for kubeadm >1.22.

Will do.

/close

@k8s-ci-robot
Copy link
Contributor

@micahhausler: Closing this issue.

In response to this:

but the kubelet TLS bootstrap and etcd timeouts are not configurable.
i think at some point i proposed that we should expose more of these to be configurable, but if i recall correctly my ideas was overruled with consensus and the argument that most users will not need this. could be wrong about the history here..

Additional user(s) reporting for duty with a real world use case 😄

we are currently working on v1beta3 and if you'd like we could log a separate issue with a proposal for an API change to control etcd / kubelet tls timeouts, but this would mean these controls would only be possible for kubeadm >1.22.

Will do.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/phases kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

3 participants