Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman exec: periodic 30-second CPU-hogging hangs #10701

Closed
edsantiago opened this issue Jun 16, 2021 · 13 comments · Fixed by #11055
Closed

podman exec: periodic 30-second CPU-hogging hangs #10701

edsantiago opened this issue Jun 16, 2021 · 13 comments · Fixed by #11055
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

$ ./bin/podman run -d --name foo quay.io/libpod/testimage:20210610 top
0af6b438059a10b47bc78d15226433c3a794cdb32754c608dc1756ad9dd5901d
$ while :;do ./bin/podman exec foo date;done
Wed Jun 16 20:20:57 UTC 2021
Wed Jun 16 20:20:58 UTC 2021
Wed Jun 16 20:20:58 UTC 2021

Now just keep watching, it might take a minute or so. When you hear your CPU fan go loud, hit RETURN so you can mark the spot where it hangs. The next entry will be timestamped 30 seconds after the previous. Example:

Wed Jun 16 20:21:13 UTC 2021
Wed Jun 16 20:21:13 UTC 2021

Wed Jun 16 20:21:41 UTC 2021

Wed Jun 16 20:22:11 UTC 2021
Wed Jun 16 20:22:12 UTC 2021

master @ 2509a81, root and rootless.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jun 16, 2021
@rhatdan
Copy link
Member

rhatdan commented Jun 16, 2021

One would think we are leaking something. Perhaps mount points.

Have you tried this as root? If this works fine as root and not as rootless, that might point to something with fuse-overlay.

@edsantiago
Copy link
Member Author

Yep, tried root and rootless (see final line of OP). No leaked mounts that I can see (I even checked while the exec hang is in place; no new mounts)

@mheon
Copy link
Member

mheon commented Jun 16, 2021

Can't reproduce. Had it running for the last 5 minutes without any effect? F34, latest main branch.

@edsantiago
Copy link
Member Author

Oh, weird. It reproduces within seconds for me as root, and ~1 minute rootless. Also f34, 5.11.17-300.fc34. With root I can't even get three iterations of the loop without a hang.

@edsantiago
Copy link
Member Author

$ sudo ./bin/podman info
host:
  arch: amd64
  buildahVersion: 1.21.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.27-2.fc34.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.27, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: f.edsantiago.com
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.11.17-300.fc34.x86_64
  linkmode: dynamic
  memFree: 2526560256
  memTotal: 33506283520
  ociRuntime:
    name: crun
    package: crun-0.20.1-1.fc34.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.20.1
      commit: 0d42f1109fd73548f44b01b3e84d04a279e99d2e
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 8543531008
  swapTotal: 8589930496
  uptime: 1053h 44m 24.35s (Approximately 43.88 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.3.0-dev
  Built: 1623872418
  BuiltTime: Wed Jun 16 13:40:18 2021
  GitCommit: 2509a81c343314fa452a0215d05e9d74ab4ec15c
  GoVersion: go1.16.3
  OsArch: linux/amd64
  Version: 3.3.0-dev

@edsantiago
Copy link
Member Author

Well, drat. I brought up a 1mt VM, 5.11.15-300.fc34, and can't reproduce in almost an hour of looping (using test-and-break so I could leave it unattended):

# while :;do t=$SECONDS;podman exec foo date;if [[ $(($SECONDS - $t)) -gt 5 ]]; then break;fi;done

@edsantiago
Copy link
Member Author

Hypothesis: it might be the systemd journal GC.

# journalctl --disk-usage
Archived and active journals take up 4.0G in the file system.

If I add --events-backend=file to my reproducer (both podman-run and podman-exec), the problem goes away. I cannot reproduce the hang, even over the course of many minutes.

I have a 1mt VM running, and am trying my best to fill up the journal by piping binaries through logger. It's not filling up fast enough. Any suggestions?

@rhatdan
Copy link
Member

rhatdan commented Jun 18, 2021

Does journald do throttling?

@vrothberg
Copy link
Member

@msekletar, have you seen such symptoms where writes to the journal are substantially delayed?

@msekletar
Copy link
Contributor

@vrothberg I don't recall any recent issue regarding delayed writes. In general writing to journald may cause clients to block (as with any other syslog daemon). Hence logging code should be able to deal with this condition. Answer to question, "Is message that we are about to log critical enough to stall the entire application in case logging subsystem is busy or otherwise no responsive?" should be answered (and logging implemented accordingly) by the application developers. Btw, I know it is easy to ask to question and much harder to come up with the correct implementation of the answer.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

I can still reproduce this on my laptop, which has been rebooted since the last time I checked in here.

...
Thu Jul 22 01:07:38 UTC 2021

Thu Jul 22 01:08:04 UTC 2021
Thu Jul 22 01:08:04 UTC 2021
Thu Jul 22 01:08:05 UTC 2021

Thu Jul 22 01:08:31 UTC 2021

Thu Jul 22 01:08:59 UTC 2021
Thu Jul 22 01:08:59 UTC 2021
...

@vrothberg
Copy link
Member

I bisected a bit and it looks like it has started with commit 341e6a1.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 27, 2021
Commit 341e6a1 made sure that all exec sessions are getting cleaned up.
But it also came with a peformance penalty.  Fix that penalty by
spawning the cleanup process to really only cleanup the exec session
without attempting to remove the container.

[NO TESTS NEEDED] since we have no means to test such performance
issues in CI.

Fixes: containers#10701
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants