-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: containerized mount utilities in pods #589
Proposal: containerized mount utilities in pods #589
Conversation
### Controller | ||
* There will be new parameter to kube-controller-manager and kubelet: | ||
* `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside. | ||
* `--experimental-mount-plugins`, which contains comma-separated list of all volume plugins that should run their utilities in pods instead on the host. The list must include also all flex volume drivers. Without this option, controllers and kubelet would not know if a plugin should use a pod with mount utilites or directly host, especially on startup when the daemon set may not yet be fully deployed on all nodes. * If so, it finds a pod in the dedicated namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin with `VolumeExec` pointing to the pod. All utilities that are executed by the volume plugin for attach/detach/provision/delete are executed in the pod as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit I don't like this experimental-mount-plugins
cmdline option. Can anyone find a bulletproof way, how kubelet / a controller can reliably find if a volume plugin should execute its utilities on the host or in a pod? Especially, when kubelet starts in a fresh cluster, the pod with mount utilities may not be running yet and kubelet must know if it should wait for it or not. Kubelet must not try to run the utilities on the host, because there may be wrong version or wrong configuration of the utilities.
## User story | ||
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes. | ||
1. Admin installs Kubernetes in any way. | ||
2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not leave discovery of appropriate mount plugins be a vendor specific requirement. Kubelet execs a script or binary that knows which container or service to talk to for each type of storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You then need to deploy the script to all nodes and masters and that's exactly I'd like to avoid. Otherwise I can deploy the mount utilities directly there, right? I see GCI and Atomic Host and CoreOS as mostly immutable images with some configuration in /etc that just starts Kubernetes with the right options (and even that is complicated enough!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these Container optimizes distros do have some writable stateful partitions. That would be necessary for other parts of the system like CNI. How does this align with CSI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI does not dictate any specific way how its drivers will run. @saad-ali expects they will run as a daemonset.
With CNI approach (a script in /opt/cni/bin), we would need a way how to deploy it on a master. This is OK for OpenShift, would it be fine for GKE, where user does not have access to masters so they can't install a attach/detach/provision/delete script for their favorite storage? And how would it talk to Kubernetes to find the right pod where to do the attaching/provisioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does GKE need to support this on the masters? User pods will not be scheduled to the masters and they would not need to have the binaries installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PV controller on master needs a way how to execute Ceph utilities when creating a volume + attach/detach controller uses the same utilities to attach/detach the volume. Now it uses plain exec. When we move Ceph utilities from master somewhere else we need to tell controller-manager where the utilities are.
* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers. | ||
* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container. | ||
|
||
-> User does not need to configure anything and sees the pod Running as usual. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered packaging mount scripts into the infra container? Kubelet will then have to exec into the infra container to mount a volume. This will alter the pod lifecycle in the kubelet where volumes are now setup prior to starting a pod.
The advantage is that all storage related processes belonging to a pod is contained with a pod's boundary and it's lifecycle is tied to that of the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- rkt does not use infrastructure container, they "hold" network NS in other way.
- using long-running pods better reflects CSI, as it will run one long-running process on each node. @saad-ali, can you confirm?
I will add a note about it to the proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note added. To be hones, infrastructure container looks compelling to me if we did not want to mimic long-running processes of CSI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra container is an implementation detail for the docker integration, I'd not recommend using it. In fact, CRI in its current state would not allow you to exec into an "infra" container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note added (and thanks for spotting this)
|
||
|
||
## Implementation notes | ||
Flex volumes won't be changed in alpha implementation of this PR. Flex volumes will still need their utilities (and binaries in /usr/libexec/kubernetes) on all hosts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason for this flex volume note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above we are hoping that flex utils will eventually be moved to pods as well - with label mount.kubernetes.io/flexvolume/foo=true
but we are not considering that as part of alpha implementation.
* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers. | ||
* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container. | ||
|
||
-> User does not need to configure anything and sees the pod Running as usual. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra container is an implementation detail for the docker integration, I'd not recommend using it. In fact, CRI in its current state would not allow you to exec into an "infra" container.
|
||
Disadvantages: | ||
* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot. | ||
* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sidecar container approach above also requires about the same level of kubelet refactoring. Might want to add it to the "drawbacks" of side-car too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to drawbacks.
Disadvantages: | ||
* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot. | ||
* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...) | ||
* Short-living processes instead of long-running ones that would mimic CSI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of the long-running processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage is that it mimics our current design of CSI and we can catch bugs or even discover that it's not ideal before CSI is standardized.
|
||
### Infrastructure containers | ||
|
||
Mount utilities could be also part of infrastructure container that holds network namespace (when using Docker). Now it's typically simple `pause` container that does not do anything, it could hold mount utilities too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, this'd only work for the legacy, pre-CRI docker integration.
## User story | ||
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes. | ||
1. Admin installs Kubernetes in any way. | ||
2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does GKE need to support this on the masters? User pods will not be scheduled to the masters and they would not need to have the binaries installed.
## User story | ||
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes. | ||
1. Admin installs Kubernetes in any way. | ||
2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the discovery process is far from ideal. One would need to enumerate a list of plugins via a kubelet flag (which is static) before kubelet starts and before the (dynamic) DaemonSet pods are created. Any change to the plugin list will require restarting kubelet. Can we try finding other discovery methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the discovery is not ideal. What are other options? AFAIK there is no magic way how to configure kubelet dynamically. Is it possible to have a config object somewhere where all kubelets and controller-manager would reliably get list of volume plugins that are supposed to be containerized?
The list is needed only at startup, where kubelet gets its first pods from scheduler - a pod that uses e.g. a gluster volume may be scheduled before daemon set for gluster is started or daemon set controller spawns a pod on the node. With the list kubelet knows that it should wait. Without the list, it blindly tries to mount the Gluster volume on the host, which is likely to fail with something as ugly as wrong fs type, bad option, bad superblock on 192.168.0.1:/foo missing codepage or helper program, or other error
. mount stderr and exit codes are not helpful at all here.
When all daemon sets are up and running we don't need --experimental-mount-plugins
at all and dynamic discovery works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed --experimental-mount-plugins
for now, but it will behave exactly as I described in the previous comment - weird errors may appear in pod events during kubelet startup before a pod with mount utilities is scheduled and started.
I updated the proposal with current development:
|
* `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside. It would default to `kube-mount`. | ||
* Whenever PV or attach/detach controller needs to call a volume plugin, it looks for *any* running pod in the specified namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin so it all mount utilities are executed as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`). | ||
* If such pod does not exist, it executes the mount utilities on the host as usual. | ||
* During alpha, no controller-manager changes will be done. That means that Ceph RBD provisioner will still require `/usr/bin/rbd` installed on the master. All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rbd provisioner has been pulled out to here:
https://github.com/kubernetes-incubator/external-storage/tree/master/ceph/rbd
so the container can be built with the right ceph version already.
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for one volume plugin, including `mkfs` and `fsck` utilities if they're needed. | ||
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it. | ||
* The only exception are kernel modules. They are not portable across distros and they should be on the host. | ||
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are mounts constrained to be performed under /var/lib/kubelet
? If so, this seems to be a contract detail between controller/kubelet/daemonset that should be mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, any directory can be shared. It's up to the system admin + author of the privileged pods to make sure it can be shared (i.e. it's on a mount with shared mount propagation) and it's safe to share (e.g. systemd inside a container does not like /sys/fs/cgroup to be shared to the host, I don't remember exact error message but it simply won't start).
0486ab2
to
85b39a3
Compare
I updated the proposal with latest discussion om sig-node and with @tallclair.
This is basically a new proposal and needs a complete re-review. I left the original proposal as a separate commit so we can roll back easily. |
85b39a3
to
f803f7b
Compare
We considered this user story: | ||
* Admin installs Kubernetes. | ||
* Admin configures Kubernetes to use sidecar container with template XXX for glusterfs mount/unmount operations and pod with template YYY for glusterfs provision/attach/detach/delete operations. These templates would be yaml files stored somewhere. | ||
* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to extend the pod spec for doing this( sidecar template injection) operation or can it be done by existing pod spec or achievable by kube jobs or similar mechanism?
-> User does not need to configure anything and sees the pod Running as usual. | ||
-> Admin needs to set up the templates. | ||
|
||
Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be random node ? or it should be the same node where the pod is getting scheduled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be driver specific. For some drivers, its probably best to exec into the container on the host going to have the volume?
For my k8s systems, I tend to have user provided code running in containers. I usually segregate these into differently labeled nodes then the control plane. In this configuration, the container doing the reach back to, say, openstack to move around volumes from vm to vm should never run on the user reachable nodes, as access to the secret for volume manipulation would be really bad. with k8s 1.7+, the secret is inaccessable to nodes that don't have a pod referencing the secret. So targeted exec would be much much better for that use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Attach/Detach operation a random node is IMO OK. All state is kept in attach/detach controller and volume plugins, not in the utilities that are executed by a volume plugin. Note that there is only Ceph RBD that executes something during attach.
For chasing secrets, that's actually benefit of a pod with mount utilities - any secrets that are needed to talk to backend storage can be easily available to the pod via env. variables or Secret volume. And since only os.exec
will be delegated to a pod, whole command line will be provided to the pod incl. all necessary credentials.
Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod. | ||
|
||
Advantages: | ||
* It's probably easier to update the templates than update the DaemonSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one doubt here, how we are going to control the version of required mount utils ? for eg# if the mount utils need to be of a particular version can we specify that in the template ? Does that also mean there can be more than one sidecar container if user wish to ?
* It's probably easier to update the templates than update the DaemonSet. | ||
|
||
Drawbacks: | ||
* Admin needs to store the templates somewhere. Where? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant we make use of configmap or similar mechanism for this templates ? Just a thought 👍
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed. | ||
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it. | ||
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host. | ||
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have one Daemonset per volume plugin and if we share /dev
amoung these containers , there is a risk or security concern, Isnt it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure its any worse then what is there today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some things to take care of when mounting /dev
into a container, i.e. oyu need to take care of the pts
device to not break the console. And there are other things to take care of as well.
Because of this I wonder if it makes sense to add an API flag to signal that a container should get the hosts proc, sys, and dev paths. If we had such a flag it would be much more well defined what a container will get, if he is told to get the host view of these three directories.
Also a side note, we can not prevent it, but mounting the host's dev
directory into a privileged container can cause quite a lot confusion (actually any setup where more than one udev is run can cause problems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise on the "things" - Please take a look here https://github.com/kubevirt/libvirt/blob/master/libvirtd.sh#L5-L42 to see the workarounds coughhackscough we need to do to have libvirt running "properly" (we don't use all features, just a subset, and they work well so far) in a container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't influence how Docker (or other container runtime) creates/binds /dev and /sys. Once this flag is available in Docker/Moby and CRI we could expose it via Kubernetes API, but it's a long process. Until then we're stick to workarounds done inside the container.
I'll make sure we ship a well documented sample of such mount container. That's why it's alpha feature and we know all these workarounds before going to beta/stable.
* All volume plugins need to be updated to use a new `mount.Exec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to simple `os.exec` on the host or a gRPC call to a socket in `/var/lib/kubelet/plugin-sockets/` directory. | ||
|
||
### Controllers | ||
TODO: how will controller-manager talk to a remote pod? It's relatively easy to do something like `kubectl exec <mount pod>` from controller-manager, however it's harder to *discover* the right pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we could make use of labelling/selector mechanism based on the pod content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the tradeoffs of using exec vs. http to serve this? My hunch is that this should just be a service model, with a Kubernetes service that provides the volume plugin (how the controller manager identifies the service could be up for debate - predefined name? labels? namespace?). The auth{n/z} is a bit more complicated with that model though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl exec
is easy to implement, does not need a new protocol and can be restrained by RBAC. With HTTP, we need to define and maintain the protocol, its implementation, have a db for auth{n,z}, generate certificates, ...
how the controller manager identifies the service could be up for debate - predefined name? labels? namespace
getting rid of namespaces / labels was the reason why we have gRPC over UNIX sockets. If we have half of the system using gPRC, second half with kubectl exec
, why don't we use kubectl exec
(or gRPC) for everything?
* Update the pod. | ||
* Remove the taint. | ||
|
||
Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, rolling update is yet to come in DS , need to check the current status though. However there is an option called --cascade=false
and possible to do a rolling update by manually or in a scripted way, not sure is that you are looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DaemonSets support rolling update as of 1.6 (https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking if there are some tricks how to do DaemonSet rolling update that would drain a node first before updating the pod. Otherwise I need to fall back to --cascade=false
and do the update manually as @humblec suggests.
## Goal | ||
Kubernetes should be able to run all utilities that are needed to provision/attach/mount/unmount/detach/delete volumes in *pods* instead of running them on *the host*. The host can be a minimal Linux distribution without tools to create e.g. Ceph RBD or mount GlusterFS volumes. | ||
|
||
## Secondary objectives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the goals around adding (or removing) volume plugins dynamically? In other words, do you expect the pods serving the volume plugins to be deployed at cluster creation time, or at a later time? How about removing plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Volume plugins are not a real plugins, they're hardcoded in Kubernetes.
It does not really matter when the pods with mount utilities are deployed - I would expect that they should be deployed during Kubernetes installation because cluster admin plans storage ahead (e.g. has existing NFS server) , however I can imagine that cluster admin could deploy pods for Gluster volumes later as the NFS server becomes full or so.
The only exception are flex plugin drivers. In 1.7, they needed to be installed before kubelet and controller-manager started. In #833 we're trying to change it to a more dynamic model, where flex drivers can be added/removed dynamically and this proposal could be easily extended with flex drivers running in pods. So admins could dynamically install/remove flex drivers running in pods. Again. I would expect that this would be mostly done during installation of a cluster. And #833 is better place to discuss it.
## Requirements on DaemonSets with mount utilities | ||
These are rules that need to be followed by DaemonSet authors: | ||
* One DaemonSet can serve mount utilities for one or more volume plugins. We expect that one volume plugin per DaemonSet will be the most popular choice. | ||
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that are needed to provision
Provisioning is a "cluster level" operation, and is handled by the volume controller rather than the Kubelet, right? In that case, I don't think they need to be handled by the same pod. In practice its probably often the same utilities that handle both, but I don't think it should be a hard requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, technically it does not need to be the same pod.
On the other hands, the only internal volume plugin that needs to execute something during provisioning or attach/detach (i.e. initiated by controller-manager) is Ceph RBD that needs /usr/bin/rbd. The same utility is then needed by kubelet to finish attachment of the device.
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed. | ||
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it. | ||
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host. | ||
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially
/var/lib/kubelet
must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
This only applies if the Kubelet is running in a container, right? Also, it needs slave mount propogation, not shared, right? (Pardon my ignorance of this subject)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, shared is needed.
slave:
(u)mount events on the host show up in containers. events on the containers dont affect the host.
shared:
(u)mount events that are initiated from either the host or the container show up on the other side.
If you want an (u)mount event in the mount utility container to show up to kubelet, it needs shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only applies if the Kubelet is running in a container, right
No. Kubelet running on the host must see mounts mounted by a pod. Therefore we need shared mount propagation from the pod to the host. With slave propagation in the pod the mount would be visible only in the pod and not on the host.
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it. | ||
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host. | ||
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods. | ||
* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that reaps zombies of potential fuse daemons.
What does this mean? I believe the zombie process issue was fixed in 1.6 (kubernetes/kubernetes#36853)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong says in #589 (comment) that infrastructure pod ("pause") is implementation detail of docker integration and other container engines may not use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but if zombie processes is an issue for other runtimes, they should have a built-in way of dealing with them. It shouldn't be necessary to implement reaping in the pod, unless it's expected to generate a lot of zombie processes, I believe. ( @yujuhong does this sound right? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be an option to provide a base container for these mount util containers, which has a sane pid 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to stay distro agnostic here and let the DaemonSet authors use anything they want. For NFS, simple Alpine Linux + buysbox init could be enough, for Gluster and Ceph a more powerful distro is needed.
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host. | ||
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods. | ||
* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons. | ||
* The pods with mount utilities run a daemon with gRPC server that implements `VolumExecService` defined below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumExecService
nit: I'd prefer VolumePluginService
, or some other variation. I think Exec
in this case is a bit unclear.
|
||
### gRPC API | ||
|
||
`VolumeExecService` is a simple gRPC service that allows to execute anything via gRPC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a CSI API proposal out? Does this align with that? It might be worth using the CSI API in it's current state, if it's sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI is too complicated. Also, this would require a completely new implementation of at least gluster, nfs, CephFS, Ceph RBD, git volume, iSCSI, FC and ScaleIO volume plugins which is IMO too much. Keeping the plugins as they are, just using an interface that would defer os.Exec
to a pod where appropriate is IMO much simpler and without risk of breaking existing (and tested!) volume plugins.
|
||
message ExecRequest { | ||
// Command to execute | ||
string cmd = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be abstracted so that the Kubelet doesn't need to understand the specifics of the volume type. I believe this is what the volume interfaces defined in https://github.com/kubernetes/kubernetes/blob/4a73f19aed1f95b3fde1177074aee2a8bec1196e/pkg/volume/volume.go do? In that case, this API should probably mirror those interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, that would require me to rewrite the volume plugins. Volume plugins need e.g. access to CloudProvider or SecretManager, I can't put them into pods easily. And this pod would have access to all Kubernetes secrets...
Whole idea of ExecRequest/Response is to take existing and tested volume plugins and replace all os.Exec
calls with <abstract exec interface>.Exec
. Kubelet would provide the right interface implementation, leading to os.Exec
or gRPC. No big changes in the volume plugins*, simple changes in Kubelet, one common VolumeExec server daemon for all pods with mount utilities.
It does not leak any specific volume knowledge to kubelet / controller-manager. It's dumb exec interface, common to all volume plugins.
*) one or two plugins would still need nontrivial refactoring to pass the interface from place where it's available to place where it's needed, but that's another story.
* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else. | ||
|
||
### Upgrade | ||
Upgrade of the DaemonSet with pods with mount utilities needs to be done node by node and with extra care. The pods may run fuse daemons and killing such pod with glusterfs fuse daemon would kill all pods that use glusterfs on the same node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it kill the pods, or just cause IO errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IO errors. I guess health probe should fail and the pod should be rescheduled (or deployment / replication set will create a new one).
* Update the pod. | ||
* Remove the taint. | ||
|
||
Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DaemonSets support rolling update as of 1.6 (https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/)
|
||
* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else. | ||
|
||
### Upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the kubelet can't reach the pod serving a volume plugin (either do to update, or some other error) when a pod with a volume is deleted? Will the Kubelet keep retrying until it is able to unmount the volume? What are the implications of being unable to unmount the volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, kubelet tries indefinitely.
And if the pod with mount utilities is not available for a longer time... I checked volume plugins, most (if not all) run umount
on the host. So the volume gets unmounted cleanly and data won't be corrupted. Detaching an iSCSI/FC/Ceph RBD disk may be a different story. The disk may be attached forever and then it depends on the backend if it supports attaching the volume to a different node.
As I wrote, update of the daemon set is a very tricky operation and the node should be drained first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does unmounting block pod deletion? I.e. will the pod be stuck in a terminated state until the volume utility pod is able to be reached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unmounting happens after a pod is deleted.
-> User does not need to configure anything and sees the pod Running as usual. | ||
-> Admin needs to set up the templates. | ||
|
||
Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the pod need to be a highly privileged pod, likely with hostpath volume mount privileges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the first instance I'm aware of where a controller would be required to have the ability to create privileged pods. not necessarily a blocker, but that is a significant change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iic, for mounts we only need CAP_SYS_ADMIN
, however if we export `/dev/ we need privilege pods.
I misunderstood the original intent of this proposal. I thought the goal was to get much closer to the desired end-state of true CSI plugins. However, I now see that this is just providing the binary utilities for the existing (hard-coded) plugins. Given that, I'm afraid I want to go back on my original suggestions. Since this really is an exec interface, I think the original proposal of using the native CRI exec (specifically, |
@tallclair I'd like to revisit the UNIX sockets. We still need a way how to run stuff in pods with mount utilities from controller-manager, which cannot use UNIX sockets. So there must be a way (namespaces, labels) to find these pods. Why can't kubelet use the same mechanism instead of UNIX sockets? It's easy to do |
@tallclair I just had meeting with @saad-ali and @thockin and we agreed that UNIX sockets are better for now, we care about So, |
Trying to resurrect the discussion, I am still interested in this proposal. @tallclair, looking at device plugin gRPC API, it looks better to me to follow this approach and introduce "container exec API" with |
The device plugin api is a higher level abstraction than just arbitrary exec. I wasn't a part of the meeting where it was decided to stick with a socket interface, but I don't see the value in implementing an alternative arbitrary command exec interface rather than relying on |
6802ff5
to
72189ee
Compare
Reworked according to result of the latest discussion:
|
a02d582
to
f003e82
Compare
Implementation of this proposal is at kubernetes/kubernetes#53440 - it's quite small and well contained. |
|
||
To sum it up, it's just a daemon set that spawns privileged pods, running a simple init and registering itself into Kubernetes by placing a file into well-known location. | ||
|
||
**Note**: It may be quite difficult to create a pod that see's host's `/dev` and `/sys`, contains necessary kernel modules, does the initialization right and reaps zombies. We're going to provide a template with all this. During alpha, it is expected that this template will be polished as we encounter new bugs, corner cases, systemd / udev / docker weirdness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During alpha,
Is this expected to ever leave alpha? I thought this was a temporary hack while we wait for CSI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all notes about alpha in the text and added a note about feature gate and that it's going to be alpha forever.
d797c65
to
69c780f
Compare
I squashed all the commits, the PR is ready to be merged. On personal meeting with @tallclair and @saad-ali we agreed that all volume plugins are going to be moved to CSI eventually, so this proposal has limited lifetime. CSI drivers will have different discovery mechanism and all kubelet changes proposed here won't be needed. I still think this PR is useful, as it allows us to create tests for internal volume plugins so we can check their CSI counterparts for regressions in e2e tests. Wherever the CSI drivers will live, Kubernetes still needs to keep its backward compatibility and make sure that old PVs keep working. |
/assign @tallclair |
/lgtm |
Why is this not merged? "pull-community-verify — Waiting for status to be reported" |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. |
Automatic merge from submit-queue (batch tested with PRs 54005, 55127, 53850, 55486, 53440). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Containerized mount utilities This is implementation of kubernetes/community#589 @tallclair @vishh @dchen1107 PTAL @kubernetes/sig-node-pr-reviews **Release note**: ```release-note Kubelet supports running mount utilities and final mount in a container instead running them on the host. ```
Automatic merge from submit-queue. Proposal: containerized mount utilities in pods @kubernetes/sig-storage-proposals @kubernetes/sig-node-proposals
* Update TOC members list * Removed Joshua Blatt and moved him to Emeritus list * Removed various admin privileges for Josh * Added nrjpoddar as repo admin
@kubernetes/sig-storage-proposals @kubernetes/sig-node-proposals