diff --git a/enhancements/hide_container_mountpoints.md b/enhancements/hide_container_mountpoints.md new file mode 100644 index 00000000000..e1efe592841 --- /dev/null +++ b/enhancements/hide_container_mountpoints.md @@ -0,0 +1,148 @@ +--- +title: hide-container-mountpoints +authors: + - "@lack" +reviewers: + - "@haircommander" + - "@mrunalp" + - "@umohnani8" +approvers: + - TBD +creation-date: 2021-01-18 +last-updated: 2021-01-18 +status: implementable +--- + +# Hide Container Mountponts + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +The current implementation of Kubelet and CRI-O both use the top-level +namespace for all container and Kubelet mountpoints. However, moving these +container-specific mountpoints into a private namespace reduced systemd +overhead with no difference in functionality. + +## Motivation + +Systemd scans and re-scans mountpoints many times, adding a lot to the CPU +utilization of systemd and overall overhead of the host OS running OpenShift. +Changing systemd to reduce its scanning overhead is tracked in [BZ +1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work +around this exclusively within Openshift. Using a separate mountpoint to for +both CRI-O and Kubelet mounts can completely segregate all container-specific +mounts away from any systemd interaction whatsoever. + +### Goals + +- CRI-O places all container mountpoints in a separate mount namespace from systemd +- Kubelet places all container mountpoints in this same separate mount namespace from systemd +- systemd does not see any CRI-O or Kubelet mountpoints +- CRI-O containers still see all appropriate Kubelet mountpoints +- Restarting either crio.service or kubelet.service does not result in the namespaces getting out-of-sync + +### Non-Goals + +- Fix systemd mountpoint scanning +- Change CRI-O pinns to support pinning mount namespaces + +## Proposal + +We can create a separate mount namespace and cause both CRI-O and Kubelet to +launch within it to hide their many many mounts from systemd by creating: + +- A service called container-mount-namespace.service which spawns a separate + namespace (via systemd's 'PrivateMounts' mechanism), and then sleeps forever. + We don't want to create the namespace in crio.service or kubelet.service, + since if either one restarts they would lose eachother's namespaces. + +- An override file for each of crio.service and kubelet.service which wrap the + original command under 'nsenter' so they both use the mount namespace created + by 'container-mount-namespace.service' + +With these in place, both Kubelet and CRI-O create their mounts in the new +shared (with eachother) but private (from systemd) namespace. + +### User Stories + +The end-user experience should not be affected in any way by this proposal, as +there is no outward API changes. There is some supportability change though, +since anyone attempting to inspect the CRI-O or Kubelet mountpoints externally +would need to be aware that these are now available in a different namespace +than the default top-level systemd mount namespace. + +### Implementation Details/Notes/Constraints + +Here is an example of the container-mount-namespace.service: + + [Unit] + Description=Manages a mount namespace that both Kubelet and CRI-O can use to share their container-specific mounts + + [Service] + PrivateMounts=on + ExecStart=bash -c 'while :; do sleep infinity; done' + +This needs to be managed separately from either CRI-O or Kubelet to avoid the +namespaces getting out of sync if either of those services restart. The systemd +'PrivateMounts=on' facility does a nice job of creating a separate mountpoint, +and both CRI-O and Kubelet can find the associated namespace via 'nsenter' as +follows: + + nsenter -m -t $(systemctl show --property MainPID --value container-mount-namespace.service) $ORIGINAL_EXECSTART + +This will necessitate adding `Requires=container-mount-namespace.service` and +`After=container-mount-namespace.service` to both crio.service and +kubelet.service as well to ensure the PID is available. + +### Risks and Mitigations + +Any other external or 3rd-party tools would need to change to match. + +## Design Details + +### Open Questions + +The fact that there is a Kubernetes test that explicitly tests that the +container mountpoints are available to the parent operating system implies that +this may have been desirable to someone at some level at one time in the past. + > 1. What is the reason for the Kubernetes test? Is it okay to just skip or + > disable the test in OpenShift? + > 2. Are there any external utilities or 3rd-party tools that assume they can + > have access to the CRI-O or Kubelet mountpoints in the top-level mount + > namespace? + +### Test Plan + +- Disable the existing Kubernetes e2e test that checks that all mountpoints are + in the parent mount namespace +- Ensure all CRI-O and Kubelet mountpoints are visible only in the child + namespace and not the parent namespace + +## Implementation History + +- Initial proof-of-concept example (here)[https://github.com/lack/redhat-notes/tree/main/crio_unshare_mounts] + +## Drawbacks + +- Requires re-wrapping CRI-O and Kubelet services in bash and nsenter. +- Requires get-and-set of the ExecStart stanza from both of those services. +- If the namespace service restarts and then either CRI-O or Kubelet restarts, + there will be a mismatch between the mount namespaces and containers will + start to fail. Could be mitigated by enhancing CRI-O pinns utility to pin and + reuse mount namespaces. + +## Alternatives + +- Enhance systemd to support unsharing namespaces at the slice level, then put + crio.service and kubelet.service in the same slice +- Alter both CRI-O and Kubelet executables to take a mount namespace via + commandline instead of requiring `nsenter` +- Do this work upstream in Kubernetes as opposed to Openshift