Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for hiding container mountpoints from systemd #593

Closed
wants to merge 1 commit into from

Conversation

lack
Copy link
Contributor

@lack lack commented Jan 19, 2021

This is my first attempt at an openshift enhancement, all comments and suggestions are welcome!


### Risks and Mitigations

Any other external or 3rd-party tools would need to change to match.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a field should be added in open questions to investigate who may be affected by this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah big worry is about anything external that cares about the mounts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we hit perf issue with these mounts and being inspected in systemd cadvisor handler.

google/cadvisor#1573

i am not aware who actually cares about the mounts in question at this point...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Podman/Skopeo/Buildah will also need to run in the same mount namespace, otherwise they will end-up setting up the storage in the host mount namespace, possibly causing corruption of the storage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI driver pods / containers typically mount /var/lib/kubelet into their pods with rshared mount propagation. They must mount volumes there and kubelet + the container runtime must see them there. Any change in this behavior will likely break storage in OCP clusters.

What will a pod with /var/lib/kubelet HostPath volume actually see? The systemd or kubelet mount namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe : Can you give me a little more context or an example of how Podman/Skopeo/Buildah normally run in conjunction with OpenShift? If they set up storage in the host mount namespace this will be propagated down to the container mount namespace (because it's a slave of the host mount namespace); so I'm unsure how much of a concern this is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a potential issue is with /var/lib/containers/storage/overlay, as it is always mounted MS_PRIVATE and it is not propagated to different namespaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this idea is very similar to rootless podman unshare that executes the specified command in a different mount (and user in this case) namespace where all the containers are running. For rootless it is a necessity because it needs a user namespace to gain capabilities.

I think in future we could teach other container tools to check if there are images/containers mounted in a different mount namespace and try to join it, or at least raise a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe : When you mention /var/lib/containers/storage/overlay, I'd love to learn more about that filesystem and how it's used and shared between various tools, especially since it's mounted MS_PRIVATE.

If you don't mind taking the time to explain, I'd like to understand more about which process mounts the filesystem there, and which other processes expect to be able to act on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any containers tool based on containers/storage (so all of CRI-O, Podman, Skopeo, Buildah).

The first access to the storage makes sure that /var/lib/containers/storage/$DRIVER is a mount point and it is made MS_PRIVATE. All containers and images are mounted under this directory. There is also some cleanup performed when there are no more containers and images mounted.

The reason for MS_PRIVATE is to prevent that each mountpoint will be propagated to the mount namespaces for each newly created container and avoid the cost of unmounting them.

@romfreiman
Copy link

romfreiman commented Jan 19, 2021

+1 we saw in single node openshift flow that systemd consumes significant cpu resources

1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work
around this exclusively within openshift. Using a separate mountpoint to for
both crio-o and kubelet mounts can completely segregate all container-specific
mounts away from any systemd interaction whatsoever.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet uses systemd to setup mounts. Any impact to that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit more on what you mean by this? What is the current kubelet/systemd interaction like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OCP 3.x we allowed kubelet to run in a docker container. Therefore it called nsenter to escape the container's mount namespace (mount propagation was not used at that time) and systemd-run --scope -- mount -t <type> <what> <where> to run any potential fuse mount daemons outside of kubelet's pid namespace / kubelet service cgroup.

nsenter was removed, but we still drag systemd-run ... mount <what> <where>, to run fuse daemons outside of kubelet's cgroup. systemctl restart kubelet.service would kill them.

Copy link
Contributor

@jsafrane jsafrane Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this means that there is a short-lived systemd service that does the mount and dies. We do not support any fuse-based storage in 4.x. Upstream kubernetes does though. I don't think it affects how systemd handles mounts.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe Could this be related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1903553 how systemd handles mount points?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvonkok : Can you talk to me a bit more about the need for the bidirectional mount in /run/nvidia/driver? I get the sense that this is mounted by a container and intended to be seen outside that container. Is the contents needed by other containers? Some piece of software running on the host outside of the container runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvonkok hiding such mounts from systemd could probably help in the issue you are seeing

Copy link

@zvonkok zvonkok Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lack The situation I have is a Pod running privileged + rbind + systemd in the container. Now if I delete the container systemd is stopping all services in the container but also on the node and the node is rendered NotReady, only a reboot can "save" it. If I am using the Pod without the rbind meaning privileged + systemd it works as expected. Looks like the rbind is introducing something that throws off systemd. I will test your POC on my cluster and see if it helps or not and update the BZ. /cc @giuseppe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's interesting... if the cause is an instance of systemd beneath the container runtime namespace that gets over-eager because it sees state from the top-level systemd, my proposal may not be the help you need; I think the rbind will still see the state that's properly owned by the parent systemd, so the container systemd would still do too much.

If the cause is the top-level systemd seeing state it shouldn't from mountpoints owned by the container runtime or the container itself, then my fix might be helpful, since sytsemd won't see anything under mountpoints created by the container runtime or containers themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane : Coming back to your original comment regarding systemd-run, if kubelet uses systemd-run to mount a filesystem, that mount will end up happening in the top-level systemd mount namespace, which will then be propagated down and seen by CRI-O/Kubelet/etc because their namespace is a 'slave' to that top-level mount namespace. So this proposal doesn't interfere with that working, and in fact that method of mounting a filesystem circumvents this proposal's mechanism and makes such mounts visibile to systemd and the host OS again.

@lack lack force-pushed the hide_container_mountpoints branch 3 times, most recently from 1026bc1 to 26a30b7 Compare January 20, 2021 15:15
@lack
Copy link
Contributor Author

lack commented Jan 20, 2021

@romfreiman : Some ad-hoc testing we did with my workaround shows a noticable reduction in systemd CPU usage. No hard numbers yet.

@lack lack force-pushed the hide_container_mountpoints branch from 26a30b7 to a2a86f0 Compare January 20, 2021 17:25
@rphillips
Copy link
Contributor

cc @jsafrane

@rphillips
Copy link
Contributor

cc @kolyshkin @giuseppe

Comment on lines +131 to +197
> 1. What is the reason for the Kubernetes test? Is it okay to just skip or
> disable the test in OpenShift?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Mount propagation test? I wrote the test, together with mount propagation support in Kubernetes.

Documentation says:

    // MountPropagationBidirectional means that the volume in a container will
    // receive new mounts from the host or other containers, and its own mounts
    // will be propagated from the container to the host or other containers.
    // Note that this mode is recursively applied to all mounts in the volume
    // ("rshared" in Linux terminology).

It's not specified there what the "host" is. The test defines it as something that's available under nsenter --mount=/rootfs/proc/1/ns/mnt ... and that's what Kubernetes has guaranteed so far. Changing this behavior may have consequences that we do not know about. I mentioned CSI drivers above, but we never know what people use mount propagation for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for that information!

Yes, that is the test I was concerned about because it implied some kind of relationship between that host mount namespace and the container's. Basically instead of being under /rootfs/proc/1/ns/mnt, kubelet and the container runtime would be under a slightly different namespace, on level below that one.

So yes, this proposal would intentionally change that guarantee, but not all of it.

What would stay the same:

  • The volume in a container will receive new mounts from other containers
  • Its own mounts will be propagated from the container to other containers

What I'm pretty sure we can keep though I have to double-check if my current proof-of-concept does this today:

  • The volume in a container will receive new mounts from the host

What would necessarily change:

  • Its own mounts would no longer be propagated up to the host

I'll have to do some further experiments with my proof-of-concept to be sure, but I suspect the CSI case you mentioned should be okay, since it's in a container and therefore already under the lower-level umbrella namespace, thus will continue to interoperate fine with any other containers also rooted in that namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would necessarily change:

  • Its own mounts would no longer be propagated up to the host

As you wrote, CSI drivers might be fine with that - we can test the ones that we ship and we could give 3rd party vendors some time to test theirs.

Still, it's a significant a change from upstream. Everyone who uses mount propagation for anything else than /var/lib/kubelet will be broken. cc @openshift/openshift-architects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its own mounts would no longer be propagated up to the host

Give me some hypothetical examples of things that go into mounts that admins on the host wouldn't be able to see?

  • Mounted cloud disks?
  • Empty dirs?
  • Host level containers that create a directory in a host dir? such as an agent that creates logs on the host?

The first two if true would be frustrating but not end of world for admins. The last one would be a blocker if true. If the last one is true, is there a way to a) identify workloads where sharing with the host is important and b) separate those pods from regular workload pods when mounting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think that a big problem

  1. we can document for admins how to nsenter to the right namespace to see the mounts
  2. and pod created by crio would be in the same mount namespace, so will see the mounts (incl. oc debug, etc)
  3. If in the future we'll ever want to start each pod in its own mount namespace, we can add a privileged flag to inherit crio namespace, so privileged pods would still be able to see that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some conversation with @zvonkok, in the case of the special-resource-operator, the rshared mount of /run is specifically so that OCI hooks (like prestart) can mount things under /run that are then consumed by other containers. And because OCI hooks run inside the container runtime namespace (aka, a process somewhere within the process tree of CRI-O), within my proposal the mount will end up in the lower-level namespace and still be visible to all containers. It seems based on my understanding so far that the host doesn't actually need access to the mountpoints, as long as the OCI prestart hook and containers have common access to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the proposal to explicitly mention that OCI hooks will run inside this container-mount-namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also just posted a summary of this proposal to a few mailing lists to see if there are any specific use-cases or solutions people know about that this proposal would interfere with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lack I agree with @jsafrane re: opening an enhancement upstream. We can then discuss it in sigs and mailing list to make sure we aren't breaking any use cases unintentionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! I am now taking this discussion to sig-node, sig-storage, and the upstream community for comments.

@russellb
Copy link
Member

/priority important-soon

@openshift-ci-robot openshift-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 22, 2021
@lack lack force-pushed the hide_container_mountpoints branch from a2a86f0 to d14c951 Compare January 28, 2021 16:00
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign markmc after the PR has been reviewed.
You can assign the PR to them by writing /assign @markmc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lack lack force-pushed the hide_container_mountpoints branch 3 times, most recently from 6a0b8c9 to 12e0f50 Compare January 28, 2021 16:37
@lack lack force-pushed the hide_container_mountpoints branch 4 times, most recently from d7eaede to 42bb7b5 Compare February 11, 2021 14:17
@kolyshkin
Copy link

I very much like the overall idea, and the implementation seems to be decent, too.

It might (and will) break some setups in which people expect the container mounts to be visible on the host (which seems trivial to fix -- use the same nsenter wrapper as for k8s and cri-o). For this reason (of breakage) perhaps it is better to make this change in the upstream k8s and cri-o, rather than by using an operator.

approach it another way:
> 1. Should these services and overrides be installed by MCO?
> 2. If so, should they be part of the main `00-worker` profile, or in a
> separate `02-worker-container-mount-namespace` object?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the MCO part, this would depend on whether this is going to be the default behavior for OpenShift. If it is, then we should use templates for the various service files and it will be a part of the 00-worker MC. If we are giving the user a knob to be able to enable/disable this then it should be a separate MC like 02-worker-container-mount-namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that it should be enabled by default but can be disabled (in case there are use-cases where this mount hiding is undesirable).

In the MCO world, what's the best way to have a feature toggle for something like this? Is it enough for MCO to create this 02-worker-container-mount-namespace object by default that an admin could manually delete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the MCO can create the 02-worker-container-mount-namespace MC and an admin could manually delete it. But once deleted, this MC will not be created unless the admin decides to manually recreate it again. If you want the MCO to automatically create this and delete it based on a toggle, then we will have to add logic to one of the controllers in the MCO to watch for the toggle and create/delete accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! Do you have any suggestion of which controller in MCO would be best suited to have this toggle? I'd be happy to try my hand at coding it up.

@lack
Copy link
Contributor Author

lack commented Feb 19, 2021

I very much like the overall idea, and the implementation seems to be decent, too.

It might (and will) break some setups in which people expect the container mounts to be visible on the host (which seems trivial to fix -- use the same nsenter wrapper as for k8s and cri-o). For this reason (of breakage) perhaps it is better to make this change in the upstream k8s and cri-o, rather than by using an operator.

@kolyshkin: Thanks for the comments; and I have two questions for you here:

  • What setups do you know about where people expect the container mounts to be visible on the host? I would expect that this would be extremely uncommon and an anti-pattern in OpenShift (though perhaps less so in upstream k8s)
  • Without a centralized top-level installer or configuration management system like MCO, how do you think we would package something like this, which requires coordination between the init system, kubelet and whatever container runtime a user chooses for k8s?


- Enhance systemd to support unsharing namespaces at the slice level, then put
`crio.service` and `kubelet.service` in the same slice
- Alter both CRI-O and Kubelet executables to take a mount namespace via
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that for CRI-O we support joining the mount namespace as an option. We should be able to make it work with an init in C similar to runc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit to sticking with the 'nsenter' mechanism is that it would be the same for Kubelet, CRI-O, and easily adaptable to any other container runtime, but I am happy to consider natively supporting mount-namespace-entering for CRI-O as well. It's just more work.

Copy link
Member

@mrunalp mrunalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we do proceed with this once we know that we don't have edge cases we worry about it, it should still be treated as something that we back out once kernel / systemd have the fixes. Could that be captured in the enhancement?

@lack
Copy link
Contributor Author

lack commented Mar 4, 2021

@mrunalp : I am not sure if making this temporary is necessarily the right option. Once systemd/kernel are fixed and this isn't for CPU utilization purposes, I think there's a good case to be made for both tidiness/encapsulation reasons (ie, the mounts owned by OpenShift shouldn't be needed by the host) and security (if unprivileged users aren't allowed to see the container-specific mountpoint locations, this removes one source of potentially damaging knowledge about container contents).

That said, I do think it would make sense for this to be optional within OpenShift, perhaps via some trigger in MCO which would be on by default, but could be turned off to restore the older system behavior at the behest of an administrator. Would this be a reasonable compromise to your request?

@mrunalp
Copy link
Member

mrunalp commented Mar 5, 2021

That said, I do think it would make sense for this to be optional within OpenShift, perhaps via some trigger in MCO which would be on by default, but could be turned off to restore the older system behavior at the behest of an administrator. Would this be a reasonable compromise to your request?

That sounds good to have an escape hatch if needed 👍

@lack lack force-pushed the hide_container_mountpoints branch from 789dd1a to f73d626 Compare March 5, 2021 17:09
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 3, 2021
@lack
Copy link
Contributor Author

lack commented Jul 7, 2021

I will be reviving this shortly.

I've been working with the upstream sig-node team to get make sure this idea doesn't break any present or future assumptions in kubernetes upstream.

@lack
Copy link
Contributor Author

lack commented Jul 7, 2021

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 7, 2021
@lack
Copy link
Contributor Author

lack commented Jul 21, 2021

So it turns out this proposed mechanism likely prevents containers (even privileged containers) from exploiting CVE-2021-33910!

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 30d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 29, 2021
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Oct 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lack
Copy link
Contributor Author

lack commented May 10, 2022

/reopen

I'm reviving this given the request from customers for this feature, as well as some time spend on revamping the proposal to be a bit clearer.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2022

@lack: Failed to re-open PR: state cannot be changed. The hide_container_mountpoints branch was force-pushed or recreated.

In response to this:

/reopen

I'm reviving this given the request from customers for this feature, as well as some time spend on revamping the proposal to be a bit clearer.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.