Nsexec spring cleaning part I #3982

kolyshkin · 2023-08-15T01:42:07Z

This is a partial carry of #3953, containing more-or-less simple changes from it.

The only differences from the commits in the original PR are:

ported to current HEAD (a few conflicts due to recently merged PRs);
fixed a misspelled word (syncrhonisation) in a commit message;
fixed a small issue in an intermediate commit (see dbdc562#r1294110056);
fixed a few cases of referring to old function names in doc and error messages;
fixed error handling in RecvFile (err -> Err);
added a commit that fixes the parseSync handling issue;
added my Signed-off-by to some commits.

kolyshkin · 2023-08-15T04:40:28Z

OK, this timeouts in CI the same way as #3953.

kolyshkin · 2023-08-16T02:26:15Z

OK, I fixed the issue of CI timeout. This was caused by the wrong logic of error handling in (p *initProcess) start(), which was uncovered by the following optimization in the "libcontainer: sync: cleanup synchronisation code" commit:

+                       // We have a copy, the child can keep working. We don't need to
+                       // wait for the seccomp notify listener to get the fd before we
+                       // permit the child to continue because the child will happily wait
+                       // for the listener if it hits SCMP_ACT_NOTIFY.
+                       if err := writeSync(p.messageSockPair.parent, procSeccompDone); err != nil {
                                return err
                        }
-                       defer unix.Close(seccompFd)
 
                        bundle, annotations := utils.Annotations(p.config.Config.Labels)
                        containerProcessState := &specs.ContainerProcessState{
@@ -199,15 +213,10 @@ func (p *setnsProcess) start() (retErr error) {
                                containerProcessState, seccompFd); err != nil {
                                return err
                        }
-
-                       // Sync with child.
-                       if err := writeSync(p.messageSockPair.parent, procSeccompDone); err != nil {
-                               return err
-                       }

kolyshkin · 2023-08-16T02:44:25Z

@cyphar @lifubang PTAL

kolyshkin · 2023-08-16T02:49:37Z

libcontainer/utils/cmsg.go

+			if i == 0 && err == nil {
+				// Only close the first one on error.
+				continue
+			}
+			// Always close extra ones.
+			_ = unix.Close(fd)


@cyphar I slightly changed the code here; the original was this:

// Only close 0 if err != nil, and close everything else. if i != 0 || err != nil { _ = unix.Close(fd) }

I feel that my version is less compact but more readable. Feel free to 👎🏻 and I will revert :)

Ah! I wanted to fix the error but forgot (the error is we should check Err not err). Fixed now.

cyphar · 2023-08-16T02:51:49Z

Can you drop the first procfs patch? This was needed for one of the patches you haven't included here, and I need to figure out why that is necessary with the cloned_binary change (the issue is that GHA masks procfs which trips the mount_too_revealing code, but weirdly this doesn't cause issues with current runc). I will include it in the follow-up PRs.

Otherwise TESTFLAGS="-run FooBar" will result in TESTFLAGS=-run being executed in the container. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The code in this function became quite complicated and not entirely correct over time. As a result, if an error is returned from parseSync, it might end up stuck waiting for the child to finish. 1. Let's not wait() for the child twice. We already do it in the defer statement (call p.terminate()) when we are returning an error. 2. Remove sentResume and sentRun since we do not want to check if these were sent or not. Instead, introduce and check seenProcReady, as procReady is always expected from runc init. 3. Eliminate the possibility to wrap nil as an error. 4. Make sure we always call shutdown on the sync socket, and do not let shutdown error shadow the ierr. This fixes the issue of stuck `runc runc` with the optimization patch (sending procSeccompDone earlier) applied. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

*os.File is correctly tracked by the garbage collector, and there's no need to use raw file descriptors for this code. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The kernel ignores these arguments, and passing them can lead to confusing error messages (the old source is irrelevant for MS_REMOUNT), as well as causing issues for a future patch where we switch to move_mount(2). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The original implementation of cgroupns had additional synchronisation to "ensure" that the process is in the correct cgroup before unsharing the cgroupns. This behaviour was actually never necessary, and after commit 5110bd2 ("nsenter: remove cgroupns sync mechanism") there is no synchronisation at all, meaning that CLONE_NEWCGROUP should not get any special treatment. Fixes: 5110bd2 ("nsenter: remove cgroupns sync mechanism") Fixes: df3fa11 ("Add support for cgroup namespace") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

In the runc state JSON we always use snake_case. This is a no-op change, but it will cause any existing container state files to be incorrectly parsed. Luckily, commit fbf183c ("Add uid and gid mappings to mounts") has never been in a runc release so we can change this before a 1.2.z release. Fixes: fbf183c ("Add uid and gid mappings to mounts") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin · 2023-08-16T02:54:41Z

Can you drop the first procfs patch? This was needed for one of the patches you haven't included here, and I need to figure out why that is necessary with the cloned_binary change (the issue is that GHA masks procfs which trips the mount_too_revealing code, but weirdly this doesn't cause issues with current runc). I will include it in the follow-up PRs.

done

cyphar

LGTM, thanks for carrying this! I'll split the other features in #3953 into separate PRs based on this.

AkihiroSuda · 2024-06-28T05:57:03Z

Commit 20b95f2 libcontainer: seccomp: pass around *os.File for notifyfd caused a regression

[v1.2 regression] SCMP_ACT_NOTIFY rule for fcntl causes runc to hang, before connecting to the seccomp listener agent #4328

kolyshkin mentioned this pull request Aug 15, 2023

nsexec: spring cleaning #3953

Closed

4 tasks

kolyshkin marked this pull request as draft August 15, 2023 01:48

kolyshkin force-pushed the nsexec-spring-cleaning-p1 branch from c53ef6e to 2d8e8e7 Compare August 16, 2023 02:19

kolyshkin mentioned this pull request Aug 16, 2023

ci/gha: add job timeouts #3984

Merged

kolyshkin added the kind/refactor refactoring label Aug 16, 2023

kolyshkin marked this pull request as ready for review August 16, 2023 02:43

kolyshkin added this to the 1.2.0 milestone Aug 16, 2023

kolyshkin commented Aug 16, 2023

View reviewed changes

kolyshkin force-pushed the nsexec-spring-cleaning-p1 branch from 2d8e8e7 to 33a1bfb Compare August 16, 2023 02:51

kolyshkin requested review from cyphar and lifubang August 16, 2023 02:53

cyphar and others added 7 commits August 15, 2023 19:54

makefile: quote TESTFLAGS when passing to containerised make

b0c7ce5

Otherwise TESTFLAGS="-run FooBar" will result in TESTFLAGS=-run being executed in the container. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

libcontainer: seccomp: pass around *os.File for notifyfd

20b95f2

*os.File is correctly tracked by the garbage collector, and there's no need to use raw file descriptors for this code. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin force-pushed the nsexec-spring-cleaning-p1 branch from 33a1bfb to 1f25724 Compare August 16, 2023 02:54

cyphar approved these changes Aug 16, 2023

View reviewed changes

lifubang approved these changes Aug 16, 2023

View reviewed changes

lifubang merged commit fe5e2b3 into opencontainers:main Aug 16, 2023

AkihiroSuda mentioned this pull request Jun 28, 2024

[v1.2 regression] SCMP_ACT_NOTIFY rule for fcntl causes runc to hang, before connecting to the seccomp listener agent #4328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nsexec spring cleaning part I #3982

Nsexec spring cleaning part I #3982

kolyshkin commented Aug 15, 2023 •

edited

Loading

kolyshkin commented Aug 15, 2023

kolyshkin commented Aug 16, 2023

kolyshkin commented Aug 16, 2023

kolyshkin Aug 16, 2023

kolyshkin Aug 16, 2023

cyphar commented Aug 16, 2023

kolyshkin commented Aug 16, 2023

cyphar left a comment

AkihiroSuda commented Jun 28, 2024

Nsexec spring cleaning part I #3982

Nsexec spring cleaning part I #3982

Conversation

kolyshkin commented Aug 15, 2023 • edited Loading

kolyshkin commented Aug 15, 2023

kolyshkin commented Aug 16, 2023

kolyshkin commented Aug 16, 2023

kolyshkin Aug 16, 2023

Choose a reason for hiding this comment

kolyshkin Aug 16, 2023

Choose a reason for hiding this comment

cyphar commented Aug 16, 2023

kolyshkin commented Aug 16, 2023

cyphar left a comment

Choose a reason for hiding this comment

AkihiroSuda commented Jun 28, 2024

kolyshkin commented Aug 15, 2023 •

edited

Loading