From 42bb7b5a32bbb01bca528eb6af85ccc969afbe0f Mon Sep 17 00:00:00 2001 From: Jim Ramsay Date: Tue, 19 Jan 2021 15:57:16 -0500 Subject: [PATCH] Proposal for hiding container mountpoints from systemd --- enhancements/hide_container_mountpoints.md | 269 +++++++++++++++++++++ 1 file changed, 269 insertions(+) create mode 100644 enhancements/hide_container_mountpoints.md diff --git a/enhancements/hide_container_mountpoints.md b/enhancements/hide_container_mountpoints.md new file mode 100644 index 00000000000..6f559f219fb --- /dev/null +++ b/enhancements/hide_container_mountpoints.md @@ -0,0 +1,269 @@ +--- +title: hide-container-mountpoints +authors: + - "@lack" +reviewers: + - "@haircommander" + - "@mrunalp" + - "@umohnani8" +approvers: + - "@haircommander" + - "@mrunalp" + - "@umohnani8" +creation-date: 2021-01-18 +last-updated: 2021-02-08 +status: implementable +--- + +# Hide Container Mountponts + +## Release Signoff Checklist + +- [X] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +The current implementation of Kubelet and CRI-O both use the top-level +namespace for all container and Kubelet mountpoints. However, moving these +container-specific mountpoints into a private namespace reduced systemd +overhead with no difference in functionality. + +## Motivation + +Systemd scans and re-scans mountpoints many times, adding a lot to the CPU +utilization of systemd and overall overhead of the host OS running OpenShift. +Changing systemd to reduce its scanning overhead is tracked in [BZ +1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work +around this exclusively within OpenShift. Using a separate mountpoint to for +both CRI-O and Kubelet mounts can completely segregate all container-specific +mounts away from any systemd interaction whatsoever. + +### Goals + +- CRI-O places all container mountpoints in a separate mount namespace from systemd +- Kubelet places all container mountpoints in this same separate mount namespace from systemd +- systemd does not see any CRI-O or Kubelet mountpoints +- CRI-O containers still see all appropriate Kubelet mountpoints +- Restarting either crio.service or kubelet.service does not result in the namespaces getting out-of-sync + +### Non-Goals + +- Fix systemd mountpoint scanning + +## Proposal + +We can create a separate mount namespace and cause both CRI-O and Kubelet to +launch within it to hide their many many mounts from systemd by creating: + +- A service called container-mount-namespace.service which spawns a separate + namespace and pins it to a well-known location. We don't want to create the + namespace in crio.service or kubelet.service, since if either one restarts they + would lose each other's namespaces. + +- An override file for each of crio.service and kubelet.service which wrap the + original command under `nsenter` so they both use the mount namespace created + by 'container-mount-namespace.service' + +With these in place, both Kubelet and CRI-O create their mounts in the new +shared (with each other) but private (from systemd) namespace. + +### User Stories + +The end-user experience should not be affected in any way by this proposal, as +there is no outward API changes. There is some supportability change though, +since anyone attempting to inspect the CRI-O or Kubelet mountpoints externally +would need to be aware that these are now available in a different namespace +than the default top-level systemd mount namespace. + +### Implementation Details/Notes/Constraints + +A working proof-of-concept implementation based on +[MCO](https://github.com/openshift/enhancements/pull/github.com/openshift/machine-config-operator) +configuration is available +[here](https://github.com/lack/redhat-notes/tree/main/crio_unshare_mounts) + +Here is the example container-mount-namespace.service: + + [Unit] + Description=Manages a mount namespace that both kubelet and crio can use to share their container-specific mounts + + [Service] + Type=oneshot + RemainAfterExit=yes + RuntimeDirectory=container-mount-namespace + Environment=RUNTIME_DIRECTORY=%t/container-mount-namespace + Environment=BIND_POINT=%t/container-mount-namespace/mnt + ExecStartPre=bash -c "findmnt ${RUNTIME_DIRECTORY} || mount --make-unbindable --bind ${RUNTIME_DIRECTORY} ${RUNTIME_DIRECTORY}" + ExecStartPre=touch ${BIND_POINT} + ExecStart=unshare --mount=${BIND_POINT} --propagation slave + ExecStop=umount -R ${RUNTIME_DIRECTORY} + +This needs to be managed separately from either CRI-O or Kubelet to avoid the +namespaces getting out of sync if either of those services restart. This +example no longer uses the systemd `PrivateMounts=on` facility as a previous +proof-of-concept did, but uses `unshare` to create the namespace with slave +propagation and additionally pins the mount namespace to a filesystem location +(`/run/container-mount-namespace/mnt` in the example). + +Both CRI-O and Kubelet can then find the associated namespace via +`nsenter` as follows: + + nsenter --mount=/run/container-mount-namespace/mnt $ORIGINAL_EXECSTART + +This will also necessitate adding `Requires=container-mount-namespace.service` +and `After=container-mount-namespace.service` to both crio.service and +kubelet.service as well to ensure the namespace pin is available when they +start. + +The proof-of-concept does this by injecting this override file for both +crio.service and kubelet.service, but a final implementation could do this by +editing the service files directly as well. + +### Risks and Mitigations + +The current OpenShift and Kubernetes implementations guarantee 3 things about +mountpoint visibility: +1. Mounts originating in the host OS are visible to CRI-O, OCI hooks, Kubelet, + and containers with `MountPropagation: HostToContainer` (or Bidirectional) +2. Mounts originating in CRI-O, OCI hooks, Kubelet, and containers with + `mountPropagation: Bidirectional` are visible to each other and containers + with `MountPropagation: HostToContainer` (or Bidirectional) +3. Mounts originating in CRI-O, OCK hooks, Kubelet, and containers with + `mountPropagation: Bidirectional` are visible to the host OS + +The first 2 guarantees are not changed by this proposal: +1. The new mount namespace uses 'slave' propagation, so any mounts originating + in the host OS top-level mount namespace are still propagated down into the + new 2nd-level namespace where they are visible to CRIO-O, OCI hooks, + Kubelet, and containers with `MountPropagation: HostToContainer` (or + Bidirectional), just as before. +2. CRI-O, OCI hooks, Kubelet, and any containers created by CRI-O, are all + within the same 2nd-level namespace, so they any mountpoints made by any of + these entities are visible to all others within that same mount namespace. + Additionally, any 3rd-level namespaces created below this point will have + the same relationship with the 2nd-level namespace that they previously had + with the higher-level namespace. + +The 3rd guarantee is explicitly broken by this proposal. + +This means that: +- Administrators who have connected to the host OS and want to inspect the + mountpoints originating from CRI-O, OCI hooks, Kubelet, or containers will + not be able to see them unless they explicitly enter the 2nd-level namespace. +- Any external or 3rd-party tools which run in the host mount namespace but + expect to see mountpoints created by CRI-O, OCI hooks, Kubelet, or containers + would need to be changed to enter the specific container namespace in order + to see them. + +We can mitigate this by adding a helper application to easily enter the right +mount namespace. The proof-of-concept work adds a helper to do just this. + +## Design Details + +### Open Questions + +Is the breaking of this "host OS sees CRI-O, Kubelet, and container +mountpoints" guarantee a deal-breaker? + +The implementation could take a few different forms. The proof-of-concept work +that preceded this work was a set of +[MCO](https://github.com/openshift/enhancements/pull/github.com/openshift/machine-config-operator) +changes. This could be carried forward into the new implementation, or we could +approach it another way: + > 1. Should these services and overrides be installed by MCO? + > 2. If so, should they be part of the main `00-worker` profile, or in a + > separate `02-worker-container-mount-namespace` object? + +For testing, the fact that there is a Kubernetes test that explicitly tests +that the container mountpoints are available to the parent operating system +implies that this may have been desirable to someone at some level at one time +in the past. + > 1. What is the reason for the Kubernetes test? Is it okay to just skip or + > disable the test in OpenShift? + > 2. Are there any external utilities or 3rd-party tools that assume they can + > have access to the CRI-O or Kubelet mountpoints in the top-level mount + > namespace? + +### Test Plan + +- Modify the existing Kubernetes e2e test that checks that all + mountpoints are in the parent mount namespace to check: + - All CRI-O and Kubelet mountpoints are visible only in the child + namespace and not the parent namespace + - Pod mountpoints in the 'master' 'slave' and 'private' categories still + propagate to eachother and the child namespace as expected + - Proof-of-concept e2e test changes + [here](https://github.com/openshift/kubernetes/compare/master...lack:hide_container_mountpoint) +- Pass all e2e tests at a similar rate. + +## Implementation History + +Initial proof-of-concept example is +[here](https://github.com/lack/redhat-notes/tree/main/crio_unshare_mounts). +It has MC objects that create: +- The new `container-mount-namespace.service' service +- Override files for both `crio.service` and `kubelet.service` which add the + appropriate systemd dependencies upon the `container-mount-namespace.service` + and wrap the ExecStart inside of `nsenter` +- A convenienc utility called `/usr/local/bin/nsenterCms` which can be used by + administrators or other software on the host to enter the new namespace. + +It also passed e2e tests at a fairly high rate on a 4.6.4 cluster: + +Parallel (Full output [here](https://raw.githubusercontent.com/lack/redhat-notes/main/crio_unshare_mounts/test_results/parallel.out)) + + Flaky tests: + + [k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] + [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service [Suite:openshift/conformance/parallel] [Suite:k8s] + + Failing tests: + + [k8s.io] [sig-node] Mount propagation should propagate mounts to the host [Suite:openshift/conformance/parallel] [Suite:k8s] + [sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS [Suite:openshift/conformance/parallel] + [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s] + + error: 5 fail, 951 pass, 1457 skip (20m27s) + +Serial (Full output [here](https://raw.githubusercontent.com/lack/redhat-notes/main/crio_unshare_mounts/test_results/serial.out)) + + Failing tests: + + [sig-auth][Feature:OpenShiftAuthorization][Serial] authorization TestAuthorizationResourceAccessReview should succeed [Suite:openshift/conformance/serial] + [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] + + error: 2 fail, 58 pass, 227 skip (53m14s) + +The only significant risk here is that of the `Mount propagation should +propagate mounts to the host` which is explictly and intentionally broken by +this proposal. See /Risks and Mitigations/ above for an in-depth description +of the changes. + +## Drawbacks + +- Requires re-wrapping CRI-O and Kubelet services in bash and `nsenter`, and + this is a little fragile. Could be mitigated by altering CRI-O and Kubelet to + take the mount namespace PID and/or file on the command line. On the other + hand, wrapping via `nsenter` could be reused upstream for all compatible + container runtimes. +- If the namespace service restarts and then either CRI-O or Kubelet restarts, + there will be a mismatch between the mount namespaces and containers will + start to fail. Could be mitigated by changing the namespace service to NOT + cleanup its pinned namespace but instead idempotently re-use a + previously-created namespace. However, given that the namespace service as + implemented today is a systemd `oneshot` with no actual process that needs + keeping alive, the risk of this terminating unexpectedly is very low. + +## Alternatives + +- Enhance systemd to support unsharing namespaces at the slice level, then put + crio.service and kubelet.service in the same slice +- Alter both CRI-O and Kubelet executables to take a mount namespace via + command line instead of requiring `nsenter` +- Do this work upstream in Kubernetes as opposed to OpenShift +