Add proposal for hiding container mountpoints from systemd

openshift · Jan 19, 2021 · 1026bc1 · 1026bc1
1 parent c22bbce
commit 1026bc1
Showing 1 changed file with 148 additions and 0 deletions.
diff --git a/enhancements/hide_container_mountpoints.md b/enhancements/hide_container_mountpoints.md
@@ -0,0 +1,148 @@
+---
+title: hide-container-mountpoints
+authors:
+  - "@lack"
+reviewers:
+  - "@haircommander"
+  - "@mrunalp"
+  - "@umohnani8"
+approvers:
+  - TBD
+creation-date: 2021-01-18
+last-updated: 2021-01-18
+status: implementable
+---
+
+# Hide Container Mountponts
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+The current implementation of Kubelet and CRI-O both use the top-level
+namespace for all container and Kubelet mountpoints. However, moving these
+container-specific mountpoints into a private namespace reduced systemd
+overhead with no difference in functionality.
+
+## Motivation
+
+Systemd scans and re-scans mountpoints many times, adding a lot to the CPU
+utilization of systemd and overall overhead of the host OS running OpenShift.
+Changing systemd to reduce its scanning overhead is tracked in [BZ
+1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work
+around this exclusively within Openshift. Using a separate mountpoint to for
+both CRI-O and Kubelet mounts can completely segregate all container-specific
+mounts away from any systemd interaction whatsoever.
+
+### Goals
+
+- CRI-O places all container mountpoints in a separate mount namespace from systemd
+- Kubelet places all container mountpoints in this same separate mount namespace from systemd
+- systemd does not see any CRI-O or Kubelet mountpoints
+- CRI-O containers still see all appropriate Kubelet mountpoints
+- Restarting either crio.service or kubelet.service does not result in the namespaces getting out-of-sync
+
+### Non-Goals
+
+- Fix systemd mountpoint scanning
+- Change CRI-O pinns to support pinning mount namespaces
+
+## Proposal
+
+We can create a separate mount namespace and cause both CRI-O and Kubelet to
+launch within it to hide their many many mounts from systemd by creating:
+
+- A service called container-mount-namespace.service which spawns a separate
+  namespace (via systemd's 'PrivateMounts' mechanism), and then sleeps forever.
+  We don't want to create the namespace in crio.service or kubelet.service,
+  since if either one restarts they would lose eachother's namespaces.
+
+- An override file for each of crio.service and kubelet.service which wrap the
+  original command under 'nsenter' so they both use the mount namespace created
+  by 'container-mount-namespace.service'
+
+With these in place, both Kubelet and CRI-O create their mounts in the new
+shared (with eachother) but private (from systemd) namespace.
+
+### User Stories
+
+The end-user experience should not be affected in any way by this proposal, as
+there is no outward API changes. There is some supportability change though,
+since anyone attempting to inspect the CRI-O or Kubelet mountpoints externally
+would need to be aware that these are now available in a different namespace
+than the default top-level systemd mount namespace.
+
+### Implementation Details/Notes/Constraints
+
+Here is an example of the container-mount-namespace.service:
+
+    [Unit]
+    Description=Manages a mount namespace that both Kubelet and CRI-O can use to share their container-specific mounts
+
+    [Service]
+    PrivateMounts=on
+    ExecStart=bash -c 'while :; do sleep infinity; done'
+
+This needs to be managed separately from either CRI-O or Kubelet to avoid the
+namespaces getting out of sync if either of those services restart. The systemd
+'PrivateMounts=on' facility does a nice job of creating a separate mountpoint,
+and both CRI-O and Kubelet can find the associated namespace via 'nsenter' as
+follows:
+
+    nsenter -m -t $(systemctl show --property MainPID --value container-mount-namespace.service) $ORIGINAL_EXECSTART
+
+This will necessitate adding `Requires=container-mount-namespace.service` and
+`After=container-mount-namespace.service` to both crio.service and
+kubelet.service as well to ensure the PID is available.
+
+### Risks and Mitigations
+
+Any other external or 3rd-party tools would need to change to match.
+
+## Design Details
+
+### Open Questions
+
+The fact that there is a Kubernetes test that explicitly tests that the
+container mountpoints are available to the parent operating system implies that
+this may have been desirable to someone at some level at one time in the past.
+ > 1. What is the reason for the Kubernetes test? Is it okay to just skip or
+ >    disable the test in OpenShift?
+ > 2. Are there any external utilities or 3rd-party tools that assume they can
+ >    have access to the CRI-O or Kubelet mountpoints in the top-level mount
+ >    namespace?
+
+### Test Plan
+
+- Disable the existing Kubernetes e2e test that checks that all mountpoints are
+  in the parent mount namespace
+- Ensure all CRI-O and Kubelet mountpoints are visible only in the child
+  namespace and not the parent namespace
+
+## Implementation History
+
+- Initial proof-of-concept example (here)[https://github.com/lack/redhat-notes/tree/main/crio_unshare_mounts]
+
+## Drawbacks
+
+- Requires re-wrapping CRI-O and Kubelet services in bash and nsenter.
+- Requires get-and-set of the ExecStart stanza from both of those services.
+- If the namespace service restarts and then either CRI-O or Kubelet restarts,
+  there will be a mismatch between the mount namespaces and containers will
+  start to fail. Could be mitigated by enhancing CRI-O pinns utility to pin and
+  reuse mount namespaces.
+
+## Alternatives
+
+- Enhance systemd to support unsharing namespaces at the slice level, then put
+  crio.service and kubelet.service in the same slice
+- Alter both CRI-O and Kubelet executables to take a mount namespace via
+  commandline instead of requiring `nsenter`
+- Do this work upstream in Kubernetes as opposed to Openshift