-
Notifications
You must be signed in to change notification settings - Fork 469
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add proposal for hiding container mountpoints from systemd
- Loading branch information
Showing
1 changed file
with
194 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,194 @@ | ||
--- | ||
title: hide-container-mountpoints | ||
authors: | ||
- "@lack" | ||
reviewers: | ||
- "@haircommander" | ||
- "@mrunalp" | ||
- "@umohnani8" | ||
approvers: | ||
- TBD | ||
creation-date: 2021-01-18 | ||
last-updated: 2021-01-18 | ||
status: implementable | ||
--- | ||
|
||
# Hide Container Mountponts | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Operational readiness criteria is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
The current implementation of Kubelet and CRI-O both use the top-level | ||
namespace for all container and Kubelet mountpoints. However, moving these | ||
container-specific mountpoints into a private namespace reduced systemd | ||
overhead with no difference in functionality. | ||
|
||
## Motivation | ||
|
||
Systemd scans and re-scans mountpoints many times, adding a lot to the CPU | ||
utilization of systemd and overall overhead of the host OS running Openshift. | ||
Changing systemd to reduce its scanning overhead is tracked in [BZ | ||
1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work | ||
around this exclusively within Openshift. Using a separate mountpoint to for | ||
both CRI-O and Kubelet mounts can completely segregate all container-specific | ||
mounts away from any systemd interaction whatsoever. | ||
|
||
### Goals | ||
|
||
- CRI-O places all container mountpoints in a separate mount namespace from systemd | ||
- Kubelet places all container mountpoints in this same separate mount namespace from systemd | ||
- systemd does not see any CRI-O or Kubelet mountpoints | ||
- CRI-O containers still see all appropriate Kubelet mountpoints | ||
- Restarting either crio.service or kubelet.service does not result in the namespaces getting out-of-sync | ||
|
||
### Non-Goals | ||
|
||
- Fix systemd mountpoint scanning | ||
- Change CRI-O pinns to support pinning mount namespaces | ||
|
||
## Proposal | ||
|
||
We can create a separate mount namespace and cause both CRI-O and Kubelet to | ||
launch within it to hide their many many mounts from systemd by creating: | ||
|
||
- A service called container-mount-namespace.service which spawns a separate | ||
namespace (via systemd's 'PrivateMounts' mechanism), and then sleeps forever. | ||
We don't want to create the namespace in crio.service or kubelet.service, | ||
since if either one restarts they would lose each other's namespaces. | ||
|
||
- An override file for each of crio.service and kubelet.service which wrap the | ||
original command under 'nsenter' so they both use the mount namespace created | ||
by 'container-mount-namespace.service' | ||
|
||
With these in place, both Kubelet and CRI-O create their mounts in the new | ||
shared (with each other) but private (from systemd) namespace. | ||
|
||
### User Stories | ||
|
||
The end-user experience should not be affected in any way by this proposal, as | ||
there is no outward API changes. There is some supportability change though, | ||
since anyone attempting to inspect the CRI-O or Kubelet mountpoints externally | ||
would need to be aware that these are now available in a different namespace | ||
than the default top-level systemd mount namespace. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
Here is an example of the container-mount-namespace.service: | ||
|
||
[Unit] | ||
Description=Manages a mount namespace that both Kubelet and CRI-O can use to share their container-specific mounts | ||
|
||
[Service] | ||
PrivateMounts=on | ||
ExecStart=bash -c 'while :; do sleep infinity; done' | ||
|
||
This needs to be managed separately from either CRI-O or Kubelet to avoid the | ||
namespaces getting out of sync if either of those services restart. The systemd | ||
'PrivateMounts=on' facility does a nice job of creating a separate mountpoint, | ||
and both CRI-O and Kubelet can find the associated namespace via 'nsenter' as | ||
follows: | ||
|
||
nsenter -m -t $(systemctl show --property MainPID --value container-mount-namespace.service) $ORIGINAL_EXECSTART | ||
|
||
This will necessitate adding `Requires=container-mount-namespace.service` and | ||
`After=container-mount-namespace.service` to both crio.service and | ||
kubelet.service as well to ensure the PID is available. | ||
|
||
### Risks and Mitigations | ||
|
||
Any other external or 3rd-party tools would need to change to match. | ||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
The implementation could take a few different forms. The proof-of-concept work | ||
that preceded this work was a set of MCO changes. This could be carried | ||
forward into the new implementation, or we could approach it another way: | ||
> 1. Should these services and overrides be installed when we create | ||
> kubelet.service in `configure-helper.sh`? Or should we keep with an | ||
> MCO-based approach (see the original proof-of-concept discussion | ||
> [below][Implementation History])? | ||
> 2. Should this rely on the pragmatic current solution of a systemd service | ||
> creating the mount namespace or should we alter CRI-O's pinns to create | ||
> the namespace? | ||
> 3. Should this rely on the pragmatic current solution of using `nsenter` to | ||
> wrap the command invocation for CRI-O and Kubelet, or should both CRI-O | ||
> and Kubelet be altered to accept a mount namespace PID or file as a | ||
> commandline option? | ||
For testing, the fact that there is a Kubernetes test that explicitly tests | ||
that the container mountpoints are available to the parent operating system | ||
implies that this may have been desirable to someone at some level at one time | ||
in the past. | ||
> 1. What is the reason for the Kubernetes test? Is it okay to just skip or | ||
> disable the test in OpenShift? | ||
> 2. Are there any external utilities or 3rd-party tools that assume they can | ||
> have access to the CRI-O or Kubelet mountpoints in the top-level mount | ||
> namespace? | ||
### Test Plan | ||
|
||
- Skip or modify the existing Kubernetes e2e test that checks that all | ||
mountpoints are in the parent mount namespace | ||
- Ensure all CRI-O and Kubelet mountpoints are visible only in the child | ||
namespace and not the parent namespace | ||
- Pass all e2e tests at a similar rate. | ||
|
||
## Implementation History | ||
|
||
Initial proof-of-concept example is | ||
[here](https://github.com/lack/redhat-notes/tree/main/crio_unshare_mounts). | ||
This consisted of some makefile and python scripting to fetch the current | ||
crio.service and kubelet.service ExecStart lines, then crafting MachineConfig | ||
objects to create the new service and overrides for crio.service and | ||
kubelet.service. | ||
|
||
It also passed e2e tests at a fairly high rate on a 4.6.4 cluster: | ||
|
||
- Parallel (Full output [here](https://raw.githubusercontent.com/lack/redhat-notes/main/crio_unshare_mounts/test_results/parallel.out)) | ||
Flaky tests: | ||
|
||
[k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] | ||
[sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service [Suite:openshift/conformance/parallel] [Suite:k8s] | ||
|
||
Failing tests: | ||
|
||
[k8s.io] [sig-node] Mount propagation should propagate mounts to the host [Suite:openshift/conformance/parallel] [Suite:k8s] | ||
[sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS [Suite:openshift/conformance/parallel] | ||
[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s] | ||
|
||
error: 5 fail, 951 pass, 1457 skip (20m27s) | ||
|
||
- Serial (Full output [here](https://raw.githubusercontent.com/lack/redhat-notes/main/crio_unshare_mounts/test_results/serial.out)) | ||
Failing tests: | ||
|
||
[sig-auth][Feature:OpenShiftAuthorization][Serial] authorization TestAuthorizationResourceAccessReview should succeed [Suite:openshift/conformance/serial] | ||
[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] | ||
|
||
error: 2 fail, 58 pass, 227 skip (53m14s) | ||
|
||
## Drawbacks | ||
|
||
- Requires re-wrapping CRI-O and Kubelet services in bash and nsenter. | ||
- Requires get-and-set of the ExecStart stanza from both of those services. | ||
- If the namespace service restarts and then either CRI-O or Kubelet restarts, | ||
there will be a mismatch between the mount namespaces and containers will | ||
start to fail. Could be mitigated by enhancing CRI-O pinns utility to pin and | ||
reuse mount namespaces. | ||
|
||
## Alternatives | ||
|
||
- Enhance systemd to support unsharing namespaces at the slice level, then put | ||
crio.service and kubelet.service in the same slice | ||
- Alter both CRI-O and Kubelet executables to take a mount namespace via | ||
command line instead of requiring `nsenter` | ||
- Do this work upstream in Kubernetes as opposed to Openshift |