-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman exec: periodic 30-second CPU-hogging hangs #10701
Comments
One would think we are leaking something. Perhaps mount points. Have you tried this as root? If this works fine as root and not as rootless, that might point to something with fuse-overlay. |
Yep, tried root and rootless (see final line of OP). No leaked mounts that I can see (I even checked while the exec hang is in place; no new mounts) |
Can't reproduce. Had it running for the last 5 minutes without any effect? F34, latest main branch. |
Oh, weird. It reproduces within seconds for me as root, and ~1 minute rootless. Also f34, 5.11.17-300.fc34. With root I can't even get three iterations of the loop without a hang. |
$ sudo ./bin/podman info
host:
arch: amd64
buildahVersion: 1.21.1
cgroupControllers:
- cpuset
- cpu
- io
- memory
- hugetlb
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.0.27-2.fc34.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.27, commit: '
cpus: 8
distribution:
distribution: fedora
version: "34"
eventLogger: journald
hostname: f.edsantiago.com
idMappings:
gidmap: null
uidmap: null
kernel: 5.11.17-300.fc34.x86_64
linkmode: dynamic
memFree: 2526560256
memTotal: 33506283520
ociRuntime:
name: crun
package: crun-0.20.1-1.fc34.x86_64
path: /usr/bin/crun
version: |-
crun version 0.20.1
commit: 0d42f1109fd73548f44b01b3e84d04a279e99d2e
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
remoteSocket:
path: /run/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: false
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: ""
package: ""
version: ""
swapFree: 8543531008
swapTotal: 8589930496
uptime: 1053h 44m 24.35s (Approximately 43.88 days)
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /etc/containers/storage.conf
containerStore:
number: 1
paused: 0
running: 1
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mountopt: nodev,metacopy=on
graphRoot: /var/lib/containers/storage
graphStatus:
Backing Filesystem: btrfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "true"
imageStore:
number: 1
runRoot: /run/containers/storage
volumePath: /var/lib/containers/storage/volumes
version:
APIVersion: 3.3.0-dev
Built: 1623872418
BuiltTime: Wed Jun 16 13:40:18 2021
GitCommit: 2509a81c343314fa452a0215d05e9d74ab4ec15c
GoVersion: go1.16.3
OsArch: linux/amd64
Version: 3.3.0-dev |
Well, drat. I brought up a 1mt VM, 5.11.15-300.fc34, and can't reproduce in almost an hour of looping (using test-and-break so I could leave it unattended): # while :;do t=$SECONDS;podman exec foo date;if [[ $(($SECONDS - $t)) -gt 5 ]]; then break;fi;done |
Hypothesis: it might be the systemd journal GC. # journalctl --disk-usage
Archived and active journals take up 4.0G in the file system. If I add I have a 1mt VM running, and am trying my best to fill up the journal by piping binaries through |
Does journald do throttling? |
@msekletar, have you seen such symptoms where writes to the journal are substantially delayed? |
@vrothberg I don't recall any recent issue regarding delayed writes. In general writing to journald may cause clients to block (as with any other syslog daemon). Hence logging code should be able to deal with this condition. Answer to question, "Is message that we are about to log critical enough to stall the entire application in case logging subsystem is busy or otherwise no responsive?" should be answered (and logging implemented accordingly) by the application developers. Btw, I know it is easy to ask to question and much harder to come up with the correct implementation of the answer. |
A friendly reminder that this issue had no activity for 30 days. |
I can still reproduce this on my laptop, which has been rebooted since the last time I checked in here.
|
I bisected a bit and it looks like it has started with commit 341e6a1. |
Commit 341e6a1 made sure that all exec sessions are getting cleaned up. But it also came with a peformance penalty. Fix that penalty by spawning the cleanup process to really only cleanup the exec session without attempting to remove the container. [NO TESTS NEEDED] since we have no means to test such performance issues in CI. Fixes: containers#10701 Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
Now just keep watching, it might take a minute or so. When you hear your CPU fan go loud, hit RETURN so you can mark the spot where it hangs. The next entry will be timestamped 30 seconds after the previous. Example:
master @ 2509a81, root and rootless.
The text was updated successfully, but these errors were encountered: