Skip to content

Commit

Permalink
Rework to use gRPC between kubelet and pod
Browse files Browse the repository at this point in the history
  • Loading branch information
jsafrane committed Jul 28, 2017
1 parent 1b1d02d commit 85b39a3
Showing 1 changed file with 83 additions and 32 deletions.
115 changes: 83 additions & 32 deletions contributors/design-proposals/containerized-mounter-pod.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,21 @@ Kubernetes should be able to run all utilities that are needed to provision/atta

## Secondary objectives
These are not requirements per se, just things to consider before drawing the final design.
* CNCF designs Container Storage Interface (CSI). So far, this CSI expects that "volume plugins" on each host are long-running processes with a fixed (gRPC?) API. We should aim the same direction, using exec instead of gRPC, hoping to switch to CSI when it's ready. In other words, there should be one long-running container for a volume plugin that serves all volumes of given type on a host.
* CNCF designs Container Storage Interface (CSI). So far, this CSI expects that "volume plugins" on each host are long-running processes with a fixed gRPC API. We should aim the same direction, hoping to switch to CSI when it's ready. In other words, there should be one long-running container for a volume plugin that serves all volumes of given type on a host.
* We should try to avoid complicated configuration. The system should work out of the box or with very limited configuration.

## Terminology

**Mount utilities** for a volume pluigin are all tools that are necessary to use a volume plugin. This includes not only utilities needed to *mount* the filesystem (e.g. `mount.glusterfs` for Gluster), but also utilities needed to attach, detach, provision or delete the volume, such as `/usr/bin/rbd` for Ceph RBD.
**Mount utilities** for a volume plugin are all tools that are necessary to use a volume plugin. This includes not only utilities needed to *mount* the filesystem (e.g. `mount.glusterfs` for Gluster), but also utilities needed to attach, detach, provision or delete the volume, such as `/usr/bin/rbd` for Ceph RBD.

## User story
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
1. Admin installs Kubernetes in any way.
2. Admin runs Kubernetes as usual. There are new command line options described below, but they will have sane defaults so no configuration is necessary in most cases.
* During alpha incubation, kubelet command line option `--experimental-mount-namespace=kube-mount` **must be used** to enable this feature and to tell Kubernetes where to looks for pods with mount utilities. This option will default to `kube-mount` after alpha.
3. Admin deploys a DaemonSet that runs a pod with `mount.glusterfs` on each node in namespace `kube-mount`. In future, this could be done by installer.
4. User creates a pod that uses a GlusterFS volume. Kubelet finds a pod with mount utilities on the node and uses it to mount the volume instead of expecting that `mount.glusterfs` is available on the host.
1. Admin installs and runs Kubernetes in any way.
1. Admin deploys a DaemonSet that runs a pod with `mount.glusterfs` on each node. In future, this could be done by installer.
1. User creates a pod that uses a GlusterFS volume. Kubelet finds a pod with mount utilities on the node and uses it to mount the volume instead of expecting that `mount.glusterfs` is available on the host.

- User does not need to configure anything and sees the pod Running as usual.
- Admin needs to deploy the DaemonSet and configure Kubernetes a bit.
- Admin just needs to deploy the DaemonSet.
- It's quite hard to update the DaemonSet, see below.

## Alternatives
Expand Down Expand Up @@ -67,17 +65,16 @@ Disadvantages:
## Requirements on DaemonSets with mount utilities
These are rules that need to be followed by DaemonSet authors:
* One DaemonSet can serve mount utilities for one or more volume plugins. We expect that one volume plugin per DaemonSet will be the most popular choice.
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for one volume plugin, including `mkfs` and `fsck` utilities if they're needed.
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
* The only exception are kernel modules. They are not portable across distros and they should be on the host.
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons.
* To allow Kubernetes to discover these pods with mount utilities:
* All DaemonSets for all chosen volume plugins must run in one dedicated namespace.
* All pods with mount utilities for a volume plugin `kubernetes.io/foo` must have label `mount.kubernetes.io/foo=true`.
* All pods with mount utilities for a flex volume with driver `bar` must have label `mounter.kubernetes.io/flexvolume/bar=true` so there can be different DaemonSets for different flex drivers instead of one monolithic DaemonSet with drivers for all flex volumes.
* The pods with mount utilities run a daemon with gRPC server that implements `VolumExecService` defined below.
* Upon starting, this daemon puts a UNIX domain socket into `/var/lib/kubelet/plugin-sockets/` directory on the host. This way, kubelet is able to discover all pods with mount utilities on a node.
* Kubernetes will ship implementation of this daemon that creates the socket on the right place and simply executes anything what kubelet asks for.

To sum it up, it's just a daemon set that spawns privileged pods with some labels, running a simple init and waiting for Kubernetes to do `kubectl exec <the pod> <some utility> <args>`.
To sum it up, it's just a daemon set that spawns privileged pods, running a simple init + a daemon that executes mount utilities as requested by kubelet via gRPC.

## Design

Expand All @@ -97,30 +94,65 @@ We propose:
This ensures that kubelet runs out of the box on any distro without any configuration done by the cluster admin.

### Volume plugins
* All volume plugins need to be updated to use a new `VolumeExec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to `exec` on the host or `kubectl exec` or `docker exec` in a remote or local pod with utilities for appropriate volume plugin (or docker-exec-like command if another container engine is used).
* All volume plugins need to be updated to use a new `mount.Exec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to simple `os.exec` on the host or a gRPC call to a socket in `/var/lib/kubelet/plugin-sockets/` directory.

### Controller
* There will be new parameter to kube-controller-manager and kubelet:
* `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside. It would default to `kube-mount`.
* Whenever PV or attach/detach controller needs to call a volume plugin, it looks for *any* running pod in the specified namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin so it all mount utilities are executed as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`).
* If such pod does not exist, it executes the mount utilities on the host as usual.
* During alpha, no controller-manager changes will be done. That means that Ceph RBD provisioner will still require `/usr/bin/rbd` installed on the master. All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume.
### Controllers
TODO: how will controller-manager talk to a remote pod? It's relatively easy to do something like `kubectl exec <mount pod>` from controller-manager, however it's harder to *discover* the right pod.

### Kubelet
* kubelet will get the same parameters as described above, `--experimental-mount-namespace`.
* When kubelet talks to a volume plugin *foo*, it finds a pod in the dedicated namespace running on the node with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin with `VolumeExec` pointing to the pod. All utilities that are executed by the volume plugin for mount/unmount/waitForAttach are executed in the pod running on the node.
* In such pod does not exist, it executes the mount utilities on the host as usual.
* When kubelet talks to a volume plugin, it looks for a socket named `/var/lib/kubelet/plugin-sockets/<plugin-name>`. This allows for easier discovery of flex volume drivers - probe in https://github.com/kubernetes/community/pull/833 needs to scan `/var/lib/kubelet/plugin-sockets/` too and find sockets in any new subdirectories.
* If the socket does not exist, kubelet gives the volume plugin plain `os.Exec` as implementation of `mount.Exec` interface and all mount utilities are executed on the host.
* If the socket exists, kubelet gives the volume plugin `GRPCExec` as implementation of `mount.Exec` and all mount utilities are executed via gRPC on the socket which presumably leads to a pod with mount utilities running a gRPC server.

As consequence, kubelet will try to run mount utilities on the host when it starts and has not received pods with mount utilities yet. This is likely to fails with a cryptic error:
As consequence, kubelet will try to run mount utilities on the host when it starts and has not received pods with mount utilities yet (and thus `/var/lib/kubelet/plugin-sockets/` is empty). This is likely to fails with a cryptic error:
```
mount: wrong fs type, bad option, bad superblock on 192.168.0.1:/test_vol,
missing codepage or helper program, or other error
```

Kubelet will periodically retry mounting the volume and it will eventually succeed when pod with mount utilities is scheduled and running on the node.

### VolumePluginMgr
Volume plugin manager runs in attach/detach controller, PV controller and in kubelet and holds a list of all volume plugins. This list of volume plugins is discovered during process startup. Especially for flex volumes, the list is read from `/usr/libexec/kubernetes/...` and it is never updated. We need to update VolumePluginMgr to add flex volumes from running pods.
### gRPC API

`VolumeExecService` is a simple gRPC service that allows to execute anything via gRPC:

```protobuf
service VolumeExecService {
// Exec executes a command and returns its output.
rpc Exec(ExecRequest) returns (ExecResponse) {}
}
message ExecRequest {
// Command to execute
string cmd = 1;
// Command arguments
repeated string args = 2;
}
message ExecResponse {
// Exit code of the command.
int32 exit_code = 1;
// Capture of combined stdout + stderr.
string output = 2;
// not_found signalizes that the executed cmd was not found.
// It helps caller to construct proper ErrExecutableNotFound error.
bool not_found = 3;
}
```

* Both `ExecRequest` and `ExecResponse` are tailored for execution of mount utilities that don't need any stdin and stdout+stderr are typically short. Therefore there is no streaming of these file descriptors.
* No authentication / authorization is done on the server side, anyone who connects to the socket can execute anything. It is expected that only root has access to `/var/lib/kubelet/plugin-sockets/`.
* `.proto` file for this API will be stored in `k8s.io/kubernetes/pkg/version/apis/exec/v1alpha1`.
* `hack/update-generated-runtime.sh` will be updated to generate go files for this API.
* Should it be renamed to `update-generated-grpc-apis.sh`?
* Kubernetes will ship a daemon with server implementation of this API in `cmd/volume-exec`. This implementation simply calls `os.Exec` for each `ExecRequest` it gets and returns the right response.
* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.

### Upgrade
Upgrade of the DaemonSet with pods with mount utilities needs to be done node by node and with extra care. The pods may run fuse daemons and killing such pod with glusterfs fuse daemon would kill all pods that use glusterfs on the same node.
Expand All @@ -133,13 +165,32 @@ In order to update the DaemonSet, admin must do for every node:
Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better?


## Open items

* How will controller-manager talk to pods with mount utilities?

1. Mount pods expose a gRPC service.
* controller-manager must be configured with the service namespace + name.
* Some authentication must be implemented (=additional configuration of certificates and whatnot).
* -> seems to be complicated.

2. Mount pods run in a dedicated namespace and have labels that tell which volume plugins they can handle.
* controller manager scans a namespace with a labelselector and does `kubectl exec <pod>` to execute anything in the pod.
* Needs configuration of the namespace.
* Admin must make sure that nothing else can run in the namespace (e.g. rogue pods that would steal volumes).
* Admin must configure access to the namespace so only pv-controller and attach-detach-controller can do `exec` there.

3. We allow pods to run on hosts that run controller-manager.

* Usual socket in `/var/lib/kubelet/plugin-sockets` will work.
* Can it work on GKE?

## Implementation notes

* During alpha, only kubelet will be updated and all volume plugins except flex will be updated.
* During alpha, `kubelet --experimental-mount-namespace=<ns>` must be used to enable this feature so it does not break anything accidentally if this feature is buggy. In beta and GA, this feature will be enabled by default and `--experimental-mount-namespace=` could be used to explicitly disable this feature or change the namespace.
* During alpha, only kubelet will be updated
* Depending on flex dynamic probing in https://github.com/kubernetes/community/pull/833, flex may or may not be supported during alpha.

Consequences:

* Ceph RBD dynamic provisioning will still need `/usr/bin/rbd` installed on master(s). All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume.
* Flex still needs `/usr/libexec` scripts deployed to all nodes and master(s).
* Flex still needs `/usr/libexec` scripts deployed to master(s) and maybe to nodes.

0 comments on commit 85b39a3

Please sign in to comment.