Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cg/sd: use systemd version when generating device properties #3842

Merged
merged 1 commit into from
Apr 25, 2023

Conversation

kolyshkin
Copy link
Contributor

This is a continuation of a saga that started in PR #3498 and continued in PR #3623.

Commit 343951a added a call to os.Stat for the device path when generating systemd device properties, to avoid systemd warning for non-existing devices. The idea was, since systemd uses stat(2) to look up device properties for a given path, it will fail anyway. In addition, this allowed to suppress a warning like this from systemd:

Couldn't stat device /dev/char/10:200

NOTE that this was done because:

  • systemd could not add the rule anyway;
  • runs puts its own set of rules on top of what systemd does.

Apparently, the above change broke some setups, resulting in inability to use e.g. /dev/null inside a container. My guess is this is because in cgroup v2 we add a second eBPF program, which is not used if the first one (added by systemd) returns "access denied".

Next, commit 3b95828 fixed that by adding a call to os.Stat for "/sys/"+path (meaning, if "/dev/char/10:200" does not exist, we retry with "/sys/dev/char/10:200", and if it exists, proceed with adding a device rule with the original (non-"/sys") path).

How that second fix ever worked was a mystery, because the path we gave to systemd still doesn't exist.

Well, I think now I know.

Since systemd v240 (systemd commit 74c48bf5a8005f20) device access rules specified as /dev/{block|char}/MM:mm are no longer looked up on the filesystem, instead, if possible, those are parsed from the string.

So, we need to do different things, depending on systemd version:

  • for systemd >= v240, use the /dev/{char,block}/MM:mm as is, without doing stat() -- since systemd doesn't do stat() either;
  • for older version, check if the path exists, and skip passing it on to systemd otherwise.
  • the check for /sys/dev/{block,char}/MM:mm is not needed in either case.

Pass the systemd version to the function that generates the rules, and fix it accordingly.

@kolyshkin
Copy link
Contributor Author

My guess is this is because in cgroup v2 we add a second eBPF program, which is not used if the first one (added by systemd) returns "access denied".

@cyphar if the above is true, maybe systemd cgroup v2 driver should not add any eBPF rules at all, as they are not used anyway, instead relying on systemd to set those. Since in case of cgroup v2 we'll surely use sufficiently new systemd, there is no problem of inability to add a rule to systemd because of a non-existing device file.

@kolyshkin
Copy link
Contributor Author

This should also fix #3708 (when a systemd >= v240 is used, that is).

@kolyshkin kolyshkin added backport/1.1-pr A backport PR to release-1.1 backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 and removed backport/1.1-pr A backport PR to release-1.1 labels Apr 24, 2023
@kolyshkin kolyshkin requested a review from cyphar April 24, 2023 23:23
@kolyshkin
Copy link
Contributor Author

CI failure is a network glitch; CI restarted

not ok 138 spec validator
# (in test file tests/integration/spec.bats, line 32)
#   `git clone https://github.com/opencontainers/runtime-spec.git' failed with status 128
# runc spec (status=0):
#
# Cloning into 'runtime-spec'...
# fatal: unable to access 'https://github.com/opencontainers/runtime-spec.git/': Failed to connect to github.com port 443: Connection timed out

Commit 343951a added a call to os.Stat for the device path
when generating systemd device properties, to avoid systemd warning for
non-existing devices. The idea was, since systemd uses stat(2) to look
up device properties for a given path, it will fail anyway. In addition,
this allowed to suppress a warning like this from systemd:

> Couldn't stat device /dev/char/10:200

NOTE that this was done because:
 - systemd could not add the rule anyway;
 - runs puts its own set of rules on top of what systemd does.

Apparently, the above change broke some setups, resulting in inability
to use e.g. /dev/null inside a container. My guess is this is because
in cgroup v2 we add a second eBPF program, which is not used if the
first one (added by systemd) returns "access denied".

Next, commit 3b95828 fixed that by adding a call to os.Stat for
"/sys/"+path (meaning, if "/dev/char/10:200" does not exist, we retry
with "/sys/dev/char/10:200", and if it exists, proceed with adding a
device rule with the original (non-"/sys") path).

How that second fix ever worked was a mystery, because the path we gave
to systemd still doesn't exist.

Well, I think now I know.

Since systemd v240 (commit 74c48bf5a8005f20) device access rules
specified as /dev/{block|char}/MM:mm are no longer looked up on the
filesystem, instead, if possible, those are parsed from the string.

So, we need to do different things, depending on systemd version:

 - for systemd >= v240, use the /dev/{char,block}/MM:mm as is, without
   doing stat() -- since systemd doesn't do stat() either;
 - for older version, check if the path exists, and skip passing it on
   to systemd otherwise.
 - the check for /sys/dev/{block,char}/MM:mm is not needed in either
   case.

Pass the systemd version to the function that generates the rules, and
fix it accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@mrunalp mrunalp merged commit bf6a78c into opencontainers:main Apr 25, 2023
@kolyshkin kolyshkin added backport/1.1-done A PR in main branch which has been backported to release-1.1 and removed backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels Apr 26, 2023
@cyphar
Copy link
Member

cyphar commented Apr 26, 2023

LGTM, fwiw.

@vissible
Copy link

Apparently, the above change broke some setups, resulting in inability to use e.g. /dev/null inside a container. My guess is this is because in cgroup v2 we add a second eBPF program, which is not used if the first one (added by systemd) returns "access denied".

Where is the second eBPF program added? I met an issue related to the second eBPF program, the details please refer to NVIDIA/nvidia-container-toolkit#192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/systemd backport/1.1-done A PR in main branch which has been backported to release-1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants