-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup2: devices filtering cleanup #2951
cgroup2: devices filtering cleanup #2951
Conversation
A typo in the second commit message: |
We also can't test the |
Yeah, Maybe we can at least have a simple test case added to libcontainer/integration, which removes and re-adds a device allow rule say 100 times (and check that device access works as expected). Without this PR we should fail on 64th iteration due to eBPF progs limit reached. Same for runc update -- we could have a stupid test calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the missing integration test(s).
Yeah looking at it I think we would need to rethink the "default allow" device list behaviour if we ever want to add devices support to |
Seems as though the Microsoft PPA is down? |
CI restarted (looks like it's working today). |
CI restarted; GHA still has some issues :(
|
CI passes again now. 😸 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@AkihiroSuda @mrunalp PTAL |
runc run -d --console-socket "$CONSOLE_SOCKET" test_update | ||
[ "$status" -eq 0 ] | ||
|
||
for new_limit in $(seq 300); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq might cause extra alloc compared to for ((i = 0; i < 300; i++))
, but negligible
@@ -648,3 +648,29 @@ EOF | |||
runc resume test_update | |||
[ "$status" -eq 0 ] | |||
} | |||
|
|||
@test "runc update replaces devices cgroup program" { | |||
[[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is NOP in rootless, right? Can we have a comment line to explain the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also no-op for cgroupv1... Yeah I'll add a comment.
retries++ | ||
continue | ||
} | ||
return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may get EINTR wand want to retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bpf(2) is considered EINTR-safe. Can not quote any sources though :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that you can get EINTR from bpf
(especially in Go thanks to green thread preemption). cilium/ebpf#24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit saying that it is only about BPF_PROG_TEST_RUN
(and so I was wrong about bpf being EINTR-safe), and this is the only bpf call that cilium/ebpf wraps in retry-on-eintr loop.
In some other place (cilium/ebpf#207 (comment)) they deduce that bpf syscall with some other parameters could not return EINTR, so maybe this behavior is limited to BPF_PROG_TEST_RUN.
OTOH it won't hurt to add retry-on-EINTR loop, something like:
for {
_, _, errno := unix.Syscall(unix.SYS_BPF,
uintptr(unix.BPF_PROG_QUERY),
uintptr(unsafe.Pointer(&query)),
unsafe.Sizeof(query))
if errno != unix.EINTR {
break
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found another case -- they retry on bpf(BPF_PROG_LOAD)
but on EAGAIN
not EINTR
.
So I guess we don't need a EINTR handler here.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When running inside a Docker container, systemd is not available. The new TestFdLeaksSystemd forgot to include the relevant t.Skip section. Fixes: a7feb42 ("libct/int: add TestFdLeaksSystemd") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The devices cgroup emulator is also useful for removing unneeded rules as well as computing what the final default-allow state of the filter will be (allow-list or deny-list). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
There were several issues with the previous cgroupv2 devices filter generator implementation, stemming from the previous implementation using a few too many tricks to implement the correct cgroup behaviour (rules were handled in reverse order, with wildcards having particularly special interpretations). As a result, some slightly odd configurations with rules in specific orders could result in incorrect filters being generated. By switching to the emulator which is already used by cgroupv1, we can guarantee that the behaviour of filters in both cgroup versions will be identical, as well as making use of the hardenings in the emulator (not allowing users to add deny rules the kernel will ignore). (Note that because the ordering of the devices emulator rules is deterministic and based on the rule value, the existing test rules had to be reordered slightly.) Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In the normal cases (only one existing filter or no existing filters), just make use of BPF_F_REPLACE if there is one existing filter. However if there is more than one filter applied, we should probably remove all other filters since the alternative is that we will never remove our old filters. The only two other viable ways of solving this problem would be to use BPF pins to either pin the eBPF program using a predictable name (so we can always only replace *our* programs) or to switch away from custom programs and instead use eBPF maps (which are pinned) and thus we just update the map conntents to update the ruleset. Unfortunately these both would add a hard requirement of bpffs and would require at least a minor rewrite of the eBPF filtering code -- which is better left for another time. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This is to ensure that we aren't leaking eBPF programs after "runc update". Unfortunately we cannot directly test the behaviour of cgroup program updates in an integration test because "runc update" doesn't support that behaviour at the moment. So instead we rely on the fact that each "runc update" implicitly triggers the devices rules to be updated. Without the previous patches applied, this new test will fail with errors (on cgroupv2 systems). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@@ -11,6 +11,7 @@ import ( | |||
"strconv" | |||
|
|||
"github.com/cilium/ebpf/asm" | |||
devicesemulator "github.com/opencontainers/runc/libcontainer/cgroups/devices" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: devicesEmulator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
retries++ | ||
continue | ||
} | ||
return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit saying that it is only about BPF_PROG_TEST_RUN
(and so I was wrong about bpf being EINTR-safe), and this is the only bpf call that cilium/ebpf wraps in retry-on-eintr loop.
In some other place (cilium/ebpf#207 (comment)) they deduce that bpf syscall with some other parameters could not return EINTR, so maybe this behavior is limited to BPF_PROG_TEST_RUN.
OTOH it won't hurt to add retry-on-EINTR loop, something like:
for {
_, _, errno := unix.Syscall(unix.SYS_BPF,
uintptr(unix.BPF_PROG_QUERY),
uintptr(unsafe.Pointer(&query)),
unsafe.Sizeof(query))
if errno != unix.EINTR {
break
}
}
retries++ | ||
continue | ||
} | ||
return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found another case -- they retry on bpf(BPF_PROG_LOAD)
but on EAGAIN
not EINTR
.
So I guess we don't need a EINTR handler here.
This is my attempt at solving both #2797 and #2366. I'm not entirely sure if the behaviour I've implemented if a cgroup has more than one filter installed (detach all old filters, so that only the new one is effective) is quite correct, but it's necessary to fix the program leak problem (outside of switching to BPF pins entirely).
NOTE: I wanted to add some unit tests for our devicefilter code, but unfortunately
BPF_PROG_TEST_RUN
doesn't support cgroup programs at the moment. I would have to push some code upstream for that to happen (though I'll look into that next week).Changelog entry:
Signed-off-by: Aleksa Sarai cyphar@cyphar.com