Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[remote] error creating cgroup [...]: dial unix /run/systemd/private: connect: EPERM #12362

Closed
edsantiago opened this issue Nov 19, 2021 · 18 comments · Fixed by #12388
Closed
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

New podman-remote flake. Two triggers, both in the command completion system test:

[+0573s] not ok 278 podman shell completion test
...
# # podman-remote --url unix:/tmp/podman_tmp_ykpP pod create --name created-5suGEc5kKDV9H0xfZLyMj77He0nHz9
# Error: failed to make pod: unable to create pod cgroup for pod 45b229948ea8dfc8b9fb6aa4c4e16e8eb3e03b6b7d0726054138d701b2d7a759: error creating cgroup machine.slice/machine-libpod_pod_45b229948ea8dfc8b9fb6aa4c4e16e8eb3e03b6b7d0726054138d701b2d7a759.slice: dial unix /run/systemd/private: connect: permission denied
# [ rc=125 (** EXPECTED 0 **) ]

Seen November 17 and 18 on different PRs, both times sys remote f34 root.

@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Nov 19, 2021
@edsantiago
Copy link
Member Author

This one is getting nasty. It took me three or four re-runs to get CI passing in my PR.

[sys] 278 podman shell completion test

@edsantiago
Copy link
Member Author

And I think this is the same symptom, in unit tests. The EPERM is on /proc/self/fd/X instead of /run/systemd/private, but it still looks pretty similar to me:

Running: podman-remote [options] exec 3effc3275cba6b2a0c4415a0f389bb54f69bec9cda358d52cfbc26f4e3778297 cat /volume0/test.output
time="2021-11-20T08:35:38-06:00" level=error msg="error attaching to container 3effc3275cba6b2a0c4415a0f389bb54f69bec9cda358d52cfbc26f4e3778297 exec session f38e439288313359a83ab30e543e24a37ce6a0193e136b4aae4c82c1da43f027: dial unixpacket /proc/self/fd/17: connect: permission denied"
Error: dial unixpacket /proc/self/fd/17: connect: permission denied

@rhatdan
Copy link
Member

rhatdan commented Nov 22, 2021

@Luap99 PTAL

@Luap99
Copy link
Member

Luap99 commented Nov 22, 2021

The failure has nothing to do with the shell completion test IMO. It fails at a simple pod create. I cannot help to debug this.

@edsantiago
Copy link
Member Author

I cannot reproduce this on a 1minutetip VM. I tried this and let it run for 4 hours, no failures:

[build podman from sources @ main]
# bin/podman system service -t 0 &
[...]
# while :;do env PODMAN=$(pwd)/bin/podman-remote hack/bats completion || break;done

I'm seeing a weird EPERM in a completely unrelated test (Static Build):

...
copying path '/nix/store/3lqc09mn31yd05mcy6gcd9nsbcwwphic-cachix-0.6.1' from 'https://cache.nixos.org'...
building '/nix/store/q67vi5mmd5ir8k5xmnsbbqssc4prnmrb-user-environment.drv'...
created 25 symlinks in user environment
Configured https://podman.cachix.org binary cache in /etc/nix/nix.conf
unpacking 'https://github.com/nixos/nixpkgs/archive/2a96414d7e350160a33ed0978449c9ff5b5a6eb3.tar.gz'...
error: getting status of '/var/tmp/go/src/github.com/containers/podman/contrib/systemd/user': Permission denied

I've restarted that test, operating on the assumption that it's a flake. I don't know if it's related, and if it is, I can't imagine what it would mean.

@edsantiago
Copy link
Member Author

In the last two days (as in: Saturday, Sunday, Monday up to now). This is a nasty one.

[sys] 278 podman shell completion test

@Luap99
Copy link
Member

Luap99 commented Nov 23, 2021

@giuseppe Can you look at this, it looks like podman cannot create the pod crgoup via systemd.

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

Something is giving permission denied. The process running Podman is running as root and unconfined_t, and the socket is 777.
To me this indicates that this is not DAC (ownership permissions) or SELinux causing the issue. But perhaps systemd returning permission denied when podman tries to talk to the socket.

@giuseppe
Copy link
Member

I see the following AVC in the Cirrus logs:

Nov 22 10:53:11 cirrus-task-5335687059734528 audit[4369]: AVC avc:  denied  { connectto } for  pid=4369 comm="podman" path="/run/systemd/private" scontext=system_u:system_r:container_t:s0:c331,c658 tcontext=system_u:system_r:init_t:s0 tclass=unix_stream_socket permissive=0

@edsantiago
Copy link
Member Author

Oh, this is interesting. This time the failure happens near the end of the test, not (as usual) in the beginning:

# # podman-remote --url unix:/tmp/podman_tmp_5Gy1 image untag quay.io/libpod/testimage:20210610 ruxi3jhsy9llrmm4tgjgp3xnbnv9vb:3uBMd
# # podman-remote --url unix:/tmp/podman_tmp_5Gy1 pod rm -t 0 --force created-OH66XoYcPXs8h81WbhYzwyIl4Tm7zP
# 62e2fa62631c532b8472336c4e735dc81a711f10b0bfe6302ee7def215080952
# # podman-remote --url unix:/tmp/podman_tmp_5Gy1 pod rm -t 0 --force running-OH66XoYcPXs8h81WbhYzwyIl4Tm7zP
# Error: error removing pod a68d7a7bb5c7d09dff295a71d7ee98339db586c27d19b8bce0804fc0eb7aa205 cgroup: dial unix /run/systemd/private: connect: permission denied
# [ rc=125 (** EXPECTED 0 **) ]
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: exit code is 125; expected 0
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
# # podman-remote --url unix:/tmp/podman_tmp_5Gy1 pod rm -t 0 --all --force
# Error: error removing pod 45a182b1c1ec5f1b4443c27387a375d506b133b6b940fa6484a45a0b0108ae9e cgroup: dial unix /run/systemd/private: connect: permission denied
# Error: error removing pod f657fb3dfe58a9a1db278893230903b18c299440d637886a2c9e2dd18763351b cgroup: dial unix /run/systemd/private: connect: permission denied
# [ rc=125 ]

What I find interesting:

  • One pod rm succeeds.
  • Immediately afterward, another pod rm fails.
  • teardown is invoked, and pod rm fails therein, which indicates that the EPERM is not transient

I know I shouldn't bash systemd, but I'm wondering, is it completely impossible that this is a systemd bug?

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

The AVC indicates that podman inside of a container is attempting to connect to systemd.

Nov 22 10:53:11 cirrus-task-5335687059734528 audit[4369]: AVC avc:  denied  { connectto } for  pid=4369 comm="podman" 
path="/run/systemd/private" scontext=system_u:system_r:container_t:s0:c331,c658 tcontext=system_u:system_r:init_t:s0 tclass=unix_stream_socket permissive=0

This is being blocked by SELinux and never getting to systemd. The question is why is podman running within a locked down container when this happens.

@Luap99
Copy link
Member

Luap99 commented Nov 23, 2021

The common factor are remote tests, can the server change the label of itself instead of the container by accident?

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

Theoretically yes, but I do not see how, The likely candidate was right here:
https://github.com/containers/podman/blob/main/libpod/oci_conmon_linux.go#L1436-L1450
But this code should have changed the label to "s0", but the avc Giuseppe shows above is s0:c331,c658

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

selinux.SetTaskLabel would change the label of the service process, but no one calls that.
SetProcessLabel() is setting the label of executables exec'd by podman.
@edsantiago Any chance we know the PID of the service. If we new the server was running as pid=4369 then we would know that it is the process causing the issue. (Which is most likely the case).

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

There are actually more AVCs.

type=AVC msg=audit(1637692993.270:5093): avc:  denied  { connectto } for  pid=4376 comm="podman" path="/run/dbus/system_bus_socket" scontext=system_u:system_r:container_t:s0:c305,c326 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=unix_stream_socket permissive=0
type=AVC msg=audit(1637692993.270:5094): avc:  denied  { connectto } for  pid=4376 comm="podman" path="/run/systemd/private" scontext=system_u:system_r:container_t:s0:c305,c326 tcontext=system_u:system_r:init_t:s0 tclass=unix_stream_socket permissive=0
type=AVC msg=audit(1637692993.332:5095): avc:  denied  { connectto } for  pid=4376 comm="podman" path="/run/dbus/system_bus_socket" scontext=system_u:system_r:container_t:s0:c305,c326 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=unix_stream_socket permissive=0
type=AVC msg=audit(1637692993.332:5096): avc:  denied  { connectto } for  pid=4376 comm="podman" path="/run/systemd/private" scontext=system_u:system_r:container_t:s0:c305,c326 tcontext=system_u:system_r:init_t:s0 tclass=unix_stream_socket permissive=0

@edsantiago
Copy link
Member Author

There are provisions for cat'ing the server log, but I don't see them triggered anywhere. If you're still playing with your reproducer PR, you could try:

diff --git a/Makefile b/Makefile
index ceecda274..8cef891a4 100644
--- a/Makefile
+++ b/Makefile
@@ -586,6 +586,7 @@ remotesystem:
                rc=$$?;\
                kill %1;\
                rm -f $$SOCK_FILE;\
+               echo "------------";echo "server log";cat $(PODMAN_SERVER_LOG);\
        else \
                echo "Skipping $@: 'timeout -v' unavailable'";\
        fi;\

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

@rhatdan
Copy link
Member

rhatdan commented Nov 23, 2021

We just turned on remote checkpoint testing, and the previous tests before the failing test is a checkpoint test.
I am no Hercule Poirot, but I think this is the issue.

rhatdan added a commit to rhatdan/podman that referenced this issue Nov 23, 2021
This should fix the SELinux issue we are seeing with talking to
/run/systemd/private.

Fixes: containers#12362

Also unset the XDG_RUNTIME_DIR if set, since we don't know when running
as a service if this will cause issue.s

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants