Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc run: fix mount leak #4417

Merged
merged 1 commit into from
Oct 4, 2024
Merged

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Sep 26, 2024

When preparing to mount container root, we need to make its parent mount private (i.e. disable mount propagation), otherwise the new in-container mounts are leaked to the host.

To find a parent mount, we use to read mountinfo and find the longest entry which can be a parent of the container root directory.

Unfortunately, due to kernel bug in all Linux kernels older than v5.8 (see 1, 2), sometimes mountinfo can't be read in its entirety. In this case, getParentMount may occasionally return a wrong parent mount.

As a result, we do not change the mount propagation to private, and container mounts are leaked.

Alas, we can not fix the kernel, and reading mountinfo a few times to ensure its consistency (like it's done in, say, Kubernetes) does not look like a good solution for performance reasons.

Fortunately, we don't need mountinfo. Let's just traverse the directory tree, trying to remount it private until we find a mount point (any error other than EINVAL means we just found it).

Fixes: #2404.

@cyphar
Copy link
Member

cyphar commented Sep 27, 2024

What happens if the parent mount is a bind-mount and the container rootfs is also a bind-mount from the same filesystem? Unless I'm mistaken, this device-checking logic will skip those mounts and produce errors when we pivot_root.

There are a few ways we could handle this AFAICS:

  1. openat2(current, "..", RESOLVE_NO_XDEV) to find the crossing -- Linux 5.6.
  2. statx(STATX_MNT_ID) to get the mount ID and use that to find the crossing -- Linux 5.8. (Sadly these mount IDs are not completely unique -- that was fixed in Linux 6.8 with STATX_MNT_ID_UNIQUE. But that should still work for us.)
  3. name_to_handle_at with a dummy fhandle to get the mount ID -- available for a long time but only works on filesystems that can do NFS exports (at least until AT_HANDLE_FID in Linux 6.5).

Maybe we can try to do (1) for newer kernels and fall back to (3) for older kernels, with a final fallback to st_dev if nothing else works...

@kolyshkin kolyshkin force-pushed the fix-mount-leak branch 2 times, most recently from dc0ca3a to f3f7aa4 Compare October 2, 2024 01:29
@kolyshkin
Copy link
Contributor Author

I just came up with a stupid but simpler alternative.

When preparing to mount container root, we need to make its parent mount
private (i.e. disable propagation), otherwise the new in-container
mounts are leaked to the host.

To find a parent mount, we use to read mountinfo and find the longest
entry which can be a parent of the container root directory.

Unfortunately, due to kernel bug in all Linux kernels older than v5.8
(see [1], [2]), sometimes mountinfo can't be read in its entirety. In
this case, getParentMount may occasionally return a wrong parent mount.

As a result, we do not change the mount propagation to private, and
container mounts are leaked.

Alas, we can not fix the kernel, and reading mountinfo a few times to
ensure its consistency (like it's done in, say, Kubernetes) does not
look like a good solution for performance reasons.

Fortunately, we don't need mountinfo. Let's just traverse the directory
tree, trying to remount it private until we find a mount point (any
error other than EINVAL means we just found it).

Fixes issue 2404.

[1]: https://github.com/kolyshkin/procfs-test
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin marked this pull request as ready for review October 2, 2024 21:17
@kolyshkin
Copy link
Contributor Author

@cyphar WDYT?

Ah, I just found crun does the same thing. This probably means the approach is sound (yet I'm a tad unhappy I'm not the first one who came with this idea 😿).

@kolyshkin
Copy link
Contributor Author

I also think we can make a 1.1 backport since the issue, if rare, is quite nasty, and the fix seems to be simple.

@cyphar
Copy link
Member

cyphar commented Oct 2, 2024

I checked and do_change_type (the propagation change logic) actually does a very cheap check for whether the path is a mountpoint before taking namespace_sem or mount_lock, so this approach is basically what you'd get from openat2(RESOLVE_NO_XDEV) except it works on older kernels.

@kolyshkin kolyshkin added the backport/1.1-done A PR in main branch which has been backported to release-1.1 label Oct 3, 2024
@kolyshkin
Copy link
Contributor Author

1.1 backport: #4425

@AkihiroSuda AkihiroSuda merged commit db25439 into opencontainers:main Oct 4, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1-done A PR in main branch which has been backported to release-1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runc has problems due to leaked mount information
3 participants