Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

add --write-fake via unprivileged overlayfs #1793

Merged
merged 22 commits into from
Jan 5, 2024
Merged

add --write-fake via unprivileged overlayfs #1793

merged 22 commits into from
Jan 5, 2024

Conversation

reidpr
Copy link
Collaborator

@reidpr reidpr commented Dec 7, 2023

This pull request implements ch-run --write-fake, which overlayfs a writeable tmpfs atop a read-only image. This makes the image appear writeable when it really is not (all writes are discarded on container exit). Needs a recent kernel: upstream 5.11 for the feature to work at all, and upstream 6.6 to avoid some weirdness (distros vary).

Closes #96, but I'm tagging it as a stand-alone PR because I think its scope exceeds that of #96.

@reidpr reidpr added this to the 0.36 milestone Dec 7, 2023
@reidpr reidpr self-assigned this Dec 7, 2023
@reidpr
Copy link
Collaborator Author

reidpr commented Dec 7, 2023

@olifre @nschan, can you try this branch?

May be relevant to nextflow-io/nextflow#3367 and nextflow-io/nextflow#4463.

@olifre
Copy link
Contributor

olifre commented Dec 7, 2023

@reidpr Thanks, I gave this a short spin right now, this looks promising!
I only have systems with kernel 6.1 at hand at the moment, so no overlayfs-xattr support yet.

I encountered a few problems:

  • Many images will have /home owned by root (or on a read-only FS not owned by the user starting the container) and with 755 permissions, so --home with overlayfs2 (as-is) will fail with an error like:

    ch-run[22328]: error: can't mkdir: /mnt/merged/home/olifre: Permission denied (ch_misc.c:461 13)
    

    This is "kind of" expected. One way to work around it would be to rebuild the top level with bind mounts for each directory / file on the top level and recreate /home with different permissions (as outlined in Add support for bind mounts to directories not existing within a container on read-only FS #96 , but also with the downsides described there). But of course, this could happen at any directory level, and lead to loads of bind mounts.

  • I run into an unexpected complexity when trying to run this through Gentoo package management. Gentoo uses a Sandbox during packaging, which catches read/write access to unexpected locations during the configure / build / install phases and bails out. This sandbox right now does not seem to differentiate between "regular" writes and writes within user / mount namespaces, which is why it explodes when the new CH_OVERLAY_C check is executed (both when writing to /proc/self/* and when working with /mnt later on). Some more technical details on how the sandbox operates are given in their README. I think there are several ways this could be solved, I could think of three approaches:

    1. The Sandbox used by Gentoo could treat harmless writes in namespaces as safe. That's likely not a small development work to get right, so I don't see it happening in the near future (it seems there are no other packages packaged for Gentoo doing this during configure / build).
    2. I could "whitelist" all those writes as "expected writes" during the configure phase. Of course, this carries the risk that actual writing to these locations could be done outside of the namespaces without it being catched by the sandbox, so I'm not sure this would go through Gentoo's review (I checked other package recipes, and there does not seem to be a precedent doing anything similar).
    3. Charliecloud could add configure flags (only to be used by packagers / experts) which can be set to signal that overlayfs2 support / xattr support is present. The usual way used in Gentoo during packaging is to base this on the configuration of the running kernel (which can be accessed easily in the packaging environment with existing helpers).

    @wiene Are there similar concerns in the Debian build tooling?

@reidpr
Copy link
Collaborator Author

reidpr commented Dec 8, 2023

Many images will have /home owned by root

Thank you @olifre; I hadn’t thought of running images owned by others. I can reproduce:

$ ls -ld /var/tmp/foo/home
drwxr-xr-x 2 root root 40 Aug  7 07:11 /var/tmp/foo/home
$ ch-run -W --home /var/tmp/foo -- true
ch-run[6103]: error: can't mkdir: /mnt/merged/home/reidpr: Permission denied (ch_misc.c:461 13)

Is this a regression? That is, if we merge this PR with /home dealt with as-is, does it make things worse for you?

@reidpr
Copy link
Collaborator Author

reidpr commented Dec 8, 2023

Gentoo ... explodes when the new CH_OVERLAY_C check is executed

This check is (at present) only used to fill in the report configure presents at its conclusion. It would be simple to add a configure flag that disables the check; the only impact would be the report is slightly less informative.

The reasoning to include it is that many folks have no idea whether their kernel supports these things (kernel version is not reliable due to distro backports), so let’s try it and tell them.

@olifre
Copy link
Contributor

olifre commented Dec 8, 2023

Is this a regression? That is, if we merge this PR with /home dealt with as-is, does it make things worse for you?

It is not, and it does not make things worse 😉. My actual use case relies on a read-only FS which contains the unpacked container. Hence, with Charliecloud without this PR, I do (expectedly) get:

ch-run[1156]: error: can't mkdir: /cvmfs/container.physik.uni-bonn.de/Debian12/default/1701962369/home/olifre: Read-only file system (ch_misc.c:461 30)

Hence, it's certainly an improvement for me, and it also fixes #96 (but in fact, I currently have images with 755 for /home on the very same read-only FS).

@olifre
Copy link
Contributor

olifre commented Dec 8, 2023

It would be simple to add a configure flag that disables the check; the only impact would be the report is slightly less informative.

That sounds like a good solution — I also don't think that this flag should ever be used by the regular user, but it could be helpful for packagers (at least for Gentoo, it would be the easiest solution).

The reasoning to include it is that many folks have no idea whether their kernel supports these things (kernel version is not reliable due to distro backports), so let’s try it and tell them.

I fully and wholeheartedly agree that these checks are a great default approach, and the output is very helpful (and no standard user should need to override / disable them).

@reidpr
Copy link
Collaborator Author

reidpr commented Dec 13, 2023

Is this a regression? That is, if we merge this PR with /home dealt with as-is, does it make things worse for you?

It is not, and it does not make things worse 😉. My actual use case relies on a read-only FS which contains the unpacked container.

One thing we could do is retain the existing behavior as a backup: overmount a new tmpfs on container /home, then mkdir the appropriate home directory in there. I'd expect this to work for your use case, so I'm a little surprised it doesn't.

@olifre
Copy link
Contributor

olifre commented Dec 13, 2023

Is this a regression? That is, if we merge this PR with /home dealt with as-is, does it make things worse for you?

It is not, and it does not make things worse 😉. My actual use case relies on a read-only FS which contains the unpacked container.

One thing we could do is retain the existing behavior as a backup: overmount a new tmpfs on container /home, then mkdir the appropriate home directory in there. I'd expect this to work for your use case, so I'm a little surprised it doesn't.

You are of course correct, my bad: I compared:

ch-run -w -b /home/olifre -vv  /cvmfs/container.physik.uni-bonn.de/singularity/Debian12/default/1701962369 bash

for both the previous version and with the unprivileged overlayfs patch. What I forgot was that --home has the special case using /tmpfs, mainly since I do not rely on this special case (in fact, I think it would be useful to have a workable solution for general overmounting for such images on read-only file systems).

So you are completely correct, --home is broken in this case when using unprivileged overlayfs vs. the working solution via an overmounted tmpfs before.

@reidpr reidpr requested a review from lucaudill December 22, 2023 16:35
@reidpr
Copy link
Collaborator Author

reidpr commented Dec 22, 2023

@olifre, this is very close if you want to have another look.

@olifre
Copy link
Contributor

olifre commented Dec 22, 2023

@reidpr Many thanks! I did several more tests using containers stored on a read-only FS. It works exceptionally well. Expectedly, findmnt looks quite interesting inside the containers 😉 .

In some cases, I hit the "15 directories" limit, but I did not hit that for actual use cases (I'm not sure if a real use case exists in which someone wants to add, for example, a sub-directory to /etc/ without actually rebuilding the container). At least from my point of view the limit seems reasonable, and the functionality is working very well! 👍

Thanks a lot (and Merry Christmas!).

@reidpr
Copy link
Collaborator Author

reidpr commented Jan 4, 2024

@olifre, thinking about that limit, I tried a different approach that was more symlink-based (similar to your suggestion I believe) in commit 1b775bd. Will you try it and compare/contrast for your use case?

@olifre
Copy link
Contributor

olifre commented Jan 5, 2024

@reidpr Many thanks! Yes, indeed this was a possibility I was thinking of. This approach works great, and the symlink ranch (in this way) seems much more lightweight than the overmount farm 😉 . It removes the limitation without the risk of getting an explosive number of bind mounts, which is great!

I did not observe any issues with any use case I had at hand, but of course some differences are seen:

  • Symlinks can naturally be deleted by the user in the container, as the links themselves don't inherit the read-only permissions of the directories from the underlying read-only image.
  • In the "ranched" directory, things are symlinks and not "directories" or "files" anymore, as expected.

I don't have any use case at hand which has a problem with that, but wanted to mention it, since some tooling is sensitive to such things (e.g. sshd, but I don't think running sshd inside a software stack container and overmounting directory components it uses is something anybody should be doing).

I would personally say "if a real-world program breaks since it does not follow links, but checks permissions / type before resolving links", that's likely not wanted behaviour, but a bug in the program instead.

So I think this approach has big advantages, e.g. allowing to bind into crowded directories such as /etc/ on read-only images. On the other hand, I don't believe it will break anything — famous last words 😉 .

@reidpr reidpr requested a review from kchilleri January 5, 2024 16:43
Copy link
Collaborator

@lucaudill lucaudill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kchilleri kchilleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@reidpr reidpr merged commit 63f77b1 into master Jan 5, 2024
@reidpr reidpr deleted the overlayfs2 branch January 5, 2024 21:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for bind mounts to directories not existing within a container on read-only FS
4 participants