-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986
cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986
Conversation
@AkihiroSuda Is it possible to do a test moby CI run with this PR to make sure I'm actually fixing the original issue? Thanks. |
@cyphar Yes, being tested in moby/moby#42450 |
Still failing 😢 |
Oh, my bad -- the cilium library doesn't automatically pass |
I tried using the fix from this PR together with the test case in #3000 (on my Fedora 34 with the kernel 5.12.8-300.fc34.x86_64). It fixes the "found more than one filter (2) attached to a cgroup" warning for the non-systemd case (i.e. when using fs2 cgroup driver), but for systemd it's still there (this is with some additional debug for the test case itself):
I did not take a look at what happens but think this happens because we let systemd apply the device configuration, then apply it again using fs2 driver (as This fs[2].Set() call is and was always a problem, not just for devices, but in general -- we let systemd set everything, then re-apply it again using fs[2] driver, and this is against what systemd docs tell us to do. OTOH removing this entirely can be problematic as not all the resources can be set using systemd. Ultimately we should solve this by only using fs drivers for those parameters that systemd can't set -- this is something I was thinking about but never got to implement. |
Yeah the reason why we set it with both was because:
So while fixing this might seem like a good idea given (1) and (2), (3) makes this quite a bit more difficult and a bit worrying -- if tomorrow systemd added support for ControllerFoo, and we didn't set ControllerFoo in the fs controller if running under systemd, neither runc nor systemd would set up ControllerFoo on older systemd versions. You could probably fix this with version or feature detection, though I'm not sure how expensive that would be. However with regards to this particular issue -- I guess this means systemd doesn't remove existing device policies when it adds its own (if it did then we would just |
Can you elaborate?
Since systemd errors out when we try to set a parameter which it does not know about, we had to introduce and use a check for systemd version (initially for So we alread know which parameters we do and do not set using systemd (at least in theory). Surely this can be coded much better, currently it looks like a band-aid (which I guess is OK since it is only applied in two places).
It seems that this warning becomes a common case -- maybe demote it to debug? |
Sure. Systemd has (at least in the past, I don't know if this is still the case today) had a habit of writing to the cgroup files of running services (with the setting it thinks the service should have) -- I remember there was a systemd service on SLES which would trigger this every once in a while but we couldn't really pin down the original cause. It's possible this was a bug of some kind, but we never managed to figure out what was causing it. I also don't know how much this is or is not true with I appreciate this is kind of vague, but it happened a while ago and was a really frustrating bug to try to nail down so I'm just a little bit paranoid about it coming back. 😅
Ah sorry, you're quite right. I was mixing up systemd's behaviour when dealing with actual Using a version check like you suggested is probably an okay solution.
Sure. It was mostly a warning because it should indicate that some other process is messing with our cgroup policies (and those policies will be deleted by us) but since it appears to be common under systemd I'll make it debug (sucks that INFO doesn't go to stderr, since INFO is probably a better log level to use...). |
I think it does (AFAIR everything from logrus goes to
|
Ah, I was going off what you said earlier. I expected it to go to stderr, but let me double-check. |
Ah, past me strikes back 🤦🏻 I did check it this time: [kir@kir-rhat x]$ cat a.go
package main
import "github.com/sirupsen/logrus"
func main() {
logrus.SetLevel(logrus.TraceLevel)
logrus.Trace("trace")
logrus.Debug("debug")
logrus.Info("info")
logrus.Warn("warn")
logrus.Error("err")
logrus.Fatal("fatal")
}
[kir@kir-rhat x]$ go run a.go
TRAC[0000] trace
DEBU[0000] debug
INFO[0000] info
WARN[0000] warn
ERRO[0000] err
FATA[0000] fatal
exit status 1
[kir@kir-rhat x]$ go run a.go 2>/dev/null |
So, other than log level nuances, is this PR ready? Looks like moby/moby#42450 is still failing for some reason :(
If not, maybe we need to do rc96 instead -- mostly due to #2997, thanks to yours truly 🤦🏻, but also as a way to do one more test before GA. |
Yeah, I'll push the log-level change. It doesn't appear to fix the Moby CI issue, but it does fix several real issues so we should probably merge it anyway and I can work on figuring out what's going wrong on Moby's CI (I expect it's a kernel version issue -- Docker works perfectly fine on my machine with the 1.0.0 GA release I prepared). |
It turns out that the cilium eBPF library doesn't degrade gracefully if BPF_F_REPLACE is not supported, so we need to work around it by treating that case as we treat the more-than-one program case. It also turns out that we weren't passing BPF_F_REPLACE explicitly, but this is required by the cilium library (causing EINVALs). Fixes: d0f2c25 ("cgroup2: devices: replace all existing filters when attaching") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
It seems that we are triggering the mutli-attach fallback in the fedora CI, but we don't have enough debugging information to really know what's going on, so add some. Unfortunately the amount of information we have available with eBPF programs in general is fairly limited (we can't get their bytecode for instance). We also demote the "more than one filter" warning to an info message because it happens very often under the systemd cgroup driver (likely when systemd configures the cgroup it isn't deleting our old program, so when our apply code runs after the systemd one there are two running programs). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
FYI, Moby CI uses Ubuntu 20.04.2 LTS kernel 5.4.0-1048-aws, which might be older than your expectation
|
Opened a new issue for tracking: #3008 |
@AkihiroSuda Yeah I looked at that a few days ago, and I've been trying to get LXD VMs to work again on openSUSE so I can try to reproduce it locally... |
It turns out that the cilium eBPF library doesn't degrade gracefully if
BPF_F_REPLACE is not supported, so we need to work around it by treating
that case as we treat the more-than-one program case.
Fixes: d0f2c25 ("cgroup2: devices: replace all existing filters when attaching")
Signed-off-by: Aleksa Sarai cyphar@cyphar.com