-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement seccomp notify support #2682
Implement seccomp notify support #2682
Conversation
98e32e0
to
5a620a8
Compare
5a620a8
to
c8e2d6f
Compare
CI failure is a known (but rare) flake, see #2425
(restarted the job) |
c8e2d6f
to
98b5487
Compare
98b5487
to
1751ec3
Compare
1751ec3
to
c25bb2b
Compare
opencontainers/runtime-spec#1074 merged now |
Hi! I'm a coworker of @mauriciovasquezbernal, I'll coordinate with him to fix the conflicts in this PR and move it to a regular PR (not draft) when ready! :) |
c25bb2b
to
e7d0927
Compare
Hi! I’ve rebased the PR and fixed all the conflicts, all tests should be green too. The OCI runtime-spec change this patch implements was recently merged and I see that the spec change is already merged in runc too. Should we mark this PR for 1.0 milestone? Currently it is in the post 1.0 milestone. I’m keeping the PR as a draft for now, as the libseccomp-golang PR is not yet merged and we need it for this PR (not only for tests, we use some consts in core runc too). However, changes on the libseccomp PR will have minimal (most likely none) consequences to this code. So, we keep it as draft until that PR is merged, but please review whenever you can :) |
e7d0927
to
30ae27b
Compare
30ae27b
to
ee8e1cb
Compare
Updated, fixed a conflict in go.sum too |
ee8e1cb
to
15343c8
Compare
Rebased on top of latest master |
libcontainer/process_linux.go
Outdated
// receive seccomp-fd | ||
pidfd, _, errno := unix.Syscall(unix.SYS_PIDFD_OPEN, uintptr(childPid), 0, 0) | ||
if errno != 0 { | ||
return newSystemErrorWithCause(errno, "performing SYS_PIDFD_OPEN syscall") | ||
} | ||
defer unix.Close(int(pidfd)) | ||
|
||
seccompFd, _, errno := unix.Syscall(unix.SYS_PIDFD_GETFD, pidfd, uintptr(sync.Fd), 0) | ||
if errno != 0 { | ||
return newSystemErrorWithCause(errno, "performing SYS_PIDFD_GETFD syscall") | ||
} | ||
defer unix.Close(int(seccompFd)) | ||
|
||
if p.config.Config.Seccomp.ListenerPath == "" { | ||
return newSystemError(errors.New("listenerPath is not set")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it makes sense to consolidate this code since it's repeated twice. Something like
seccompFd, err := recvSeccompFd(childPid, sync.Fd)
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, good call. I'm moving only these bits to the new function and not the if p.config.Config.Seccomp.ListenerPath == ""
part as we need p
for that and it would be called with p
being different structs. Also, IMHO, semantically seems better to keep both things separate
libcontainer/process_linux.go
Outdated
func sendContainerProcessState(listenerPath string, state *specs.ContainerProcessState, fd int) error { | ||
conn, err := net.Dial("unix", listenerPath) | ||
if err != nil { | ||
return fmt.Errorf("cannot connect to %q: %v\n", listenerPath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\n
should be removed.- Please to use %w for error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think
net.Dial
already includes path/address in the error message, as well as the error (dial
), so there's no need to wrap it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, returning err then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think net.Dial already includes path/address in the error message, as well as the error (dial), so there's no need to wrap it.
Correct:
$ cat main.go
package main
import (
"fmt"
"net"
)
func main() {
_, err := net.Dial("unix", "/fdsjkfhdjk")
if err != nil {
fmt.Printf("%s\n", err)
}
}
$ go run main.go
dial unix /fdsjkfhdjk: connect: no such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still LGTM :)
@AkihiroSuda this needs your still-LGTM after a rebase. @cyphar PTAL |
@AkihiroSuda if it helps, the rebase were to solve very simple conflicts (the makefile has a new target now, needed to adjust for that, some bats helpers conflicts but trivial, and things like that) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through it and it all seems very reasonable -- kolyshkin has probably reviewed this several times more thoroughly than I have. 😉
I only have one question about the seemingly arbitrary write restriction.
if call.Name == "write" { | ||
return -1, errors.New("SCMP_ACT_NOTIFY cannot be used for the write syscall") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this restriction? I can imagine how it might be unsafe (or unwise) to do this on paper, but is it really necessary to disallow this within runc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, basically it will hang for ever if we allow it. We need the write syscall to write the seccomp fd plain number over the pipe to the parent (we write to the pipe here), then the parent gets the fd with pifd_getfd and sends it over the unix socket to the seccomp agent.
If the write syscall has the notify action, we can't never send a msg to the parent, which will mean the seccomp agent won't get the seccomp fd and the write will be blocked forever.
However read/close (functions that we also use to sync with parent via the pipe) are not restricted as it is possible to send the seccomp fd to the parent, the seccomp agent will get the seccomp fd and the first few notifications it will get if read/close are handled are for the ones done in syncParentSeccomp()
(and quite trivially allow to proceed syscalls on fds that are not known by the agent, for example).
Please note the parent is able to proceed, while we are blocked in read/close, because the parent is another process and not subject to the seccomp filter.
I will add a comment to explain this in the code, if that sounds good to you @cyphar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah of course. Yeah that's a bit sad :/ -- but we can always figure out some workaround for this in the future. (I wonder how the LXC folks worked around this problem -- though I think they don't have a similar write-based synchronisation model since they're written in C which means you don't need to do so much busy work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar I've asked Christian Brauner in the past. It seems they use clone3 sharing the fds table and use a fixed number for the seccomp notify fd. Then, they get it with pidfd_getfd().
Not sure how LXC uses a fixed fd number for seccomp notify, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just pick an arbitrarily high fd and use dup3
to clone it to the right fd number. It might be interesting to look into whether we could do it with pidfd_getfd
though since we'd need to have a fallback for quite a while, it'd probably complicate the code a bit too much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, that was easy. Yeah, for runc I think is simpler to use the pipe, that is the main sync mechanism used tree-wide. We are using pidfd_getfd
here, though: we sync with the parent to let them know when the fd is ready and send the fd plain number, then the parent gets it with pidfd_getfd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot that we are requiring Linux 5.7 or newer for the feature -- pidfd_getfd isn't available for the earlier kernels with seccomp notify support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And 5.7 is also enforced by libseccomp-golang, that only supports notification with tsync (requires libseccomp api level 6). This is as the vast majority of go programs will need tsync too. Here is the conversation we had with upstream about this when writing the patches for libseccomp-golang
Before 5.7, notify and tsync where mutually exclusive (torvalds/linux@7a0df7f). From Linux 5.7 introduced a new API to enable seccomp notification with tsync together in this commit torvalds/linux@5189149.
The notify support has been merged in libseccomp-golang in this PR: seccomp/libseccomp-golang#59 Also, we update to new API of libseccomp-golang so code doesn't break. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
SendFds is a helper function for sending a set of file descriptors and a message over a unix domain socket. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
Extend the SetupSeccomp tests by adding the following cases: - Test nil config - Test empty config - Test bad action and architecture - Test all possible actions Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Implement sample seccomp agent. It's also used in integration tests in the following commit. Instructions how to use it in contrib/cmd/seccompagent/README.md Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
Test KILL and ERRNO actions. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Add functional test to check seccomp notify end-to-end. This test uses the sample seccomp agent from the contrib/cmd folder. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
00772ca
bcadefb
to
00772ca
Compare
Rebased to fix conflicts (a simple conflict with 64358c4 that moved some code we are touching) and added a comment to explain the write restriction @cyphar commented on (see #2682 (comment)). @kolyshkin @AkihiroSuda @dqminh can you please re-LGTM this? :) @cyphar thanks for the comment! I guess this is ready to merge? Or is there anything else I can do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I wish github had a way to see the difference between old and new commits, e.g. similar to what gerrit does)
Trusting the changes are mostly cosmetical, still LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this for so long!
@cyphar @kolyshkin @AkihiroSuda qminh @vbatts thanks for your time and reviews! :) |
@kolyshkin I've updated the release notes for this. I realized I forgot to mention the seccomp agent example in the runc source code! Updated to mention that now. |
Add support for notify seccomp actions. Seccomp profiles can now specify the SCMP_ACT_NOTIFY action that notifies a user space application (seccomp agent) when such syscall is performed. The seccomp agent can, in turn, decide to perform the syscall on behalf of the container and set the syscall return values or even inject file descriptors to the container. This, for example, allows for unprivileged containers to run some privileged syscalls that the seccomp agent performs on its behalf.
The format and more details can be found in the OCI runtime specification, in the seccomp section. If you are a developer and want more information, check out the example agent in runc, the linux kernel documentation, libseccomp-golang and Christian Brauner's blog post. To use this feature you need Linux >= 5.7 and libseccomp >= 2.5.0.