Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: unable to create pod cgroup: slice was already loaded or has a fragment file #1560

Open
edsantiago opened this issue Sep 11, 2024 · 7 comments

Comments

@edsantiago
Copy link
Member

So far it's a one-off, only on my laptop:

not ok 180 |200| podman pod create - hashtag AllTheOptions in 1798ms
# tags: ci:parallel 
# (from function `bail-now' in file test/system/helpers.bash, line 192,
#  from function `die' in file test/system/helpers.bash, line 958, 
#  from function `run_podman' in file test/system/helpers.bash, line 560,
#  in test file test/system/200-pod.bats, line 255)
#   `run_podman pod create --name=$podname                \' failed 
#
...
# Successfully tagged localhost/infra_image_t180-z1vymisr:latest
# 86bbdac7ba77e0f3c5ec60dbd4bd2483f06413b792dcb6f2e7de89df5a0f40c1
#
# [08:35:22.656994340] $ bin/podman pod create --name=pod-t180-z1vymisr --pod-id-file=/dev/shm/podman_bats.j48uzw/pod-id-file --hostname=eaijuxkkq6.hostt180-z1vymisr.net --add-host kfam2ieko4.qqloyvhwed.xyz:172.20.201.132 --dns 172.20.62.167 --dns-search nnh0i0kyxmesh0y.abc --dns-option ndots:219 --publish 6035:5080 --label label-frw9BFgOv6q=labelvalue-XsxLIK0sqSZymXGJ933nsX --infra-image infra_image_t180-z1vymisr --infra-command /pause_t180-z1vymisr --infra-name infra_container_t180-z1vymisr
# [08:35:23.082958461] Error: unable to create pod cgroup for pod 18e6e06291e0a3bcf40769a6359fa5c45bcf6978cd67ea20e02c7fa9f31c28f1: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_18e6e06291e0a3bcf40769a6359fa5c45bcf6978cd67ea20e02c7fa9f31c28f1.slice: Unit user-libpod_pod_18e6e06291e0a3bcf40769a6359fa5c45bcf6978cd67ea20e02c7fa9f31c28f1.slice was already loaded or has a fragment file.

I don't have privs on this repo for adding the flakes tag

I'm assuming this is a 1.17 issue, because I've never seen it before in gazillions of hours of testing.

@edsantiago
Copy link
Member Author

Seen again, different test:

not ok 192 |200| podman pod cleans cgroup and keeps limits in 924ms
# tags: ci:parallel
# (from function `bail-now' in file test/system/helpers.bash, line 192,
#  from function `die' in file test/system/helpers.bash, line 958,
#  from function `run_podman' in file test/system/helpers.bash, line 560,
#  in test file test/system/200-pod.bats, line 768)
#   `run_podman pod create --infra=$infra --memory=256M' failed
#
# [09:08:20.190255676] $ bin/podman pod create --infra=true --memory=256M
# [09:08:20.601836522] Error: unable to create pod cgroup for pod 35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc.slice: Unit user-libpod_pod_35e64b5202197234d9b5d1633a7fdbeae3f5cba98fe90d2e68f1c09f66a8c1bc.slice was already loaded or has a fragment file.

@edsantiago
Copy link
Member Author

not ok 191 |200| podman pod ps --filter in 1494ms
# tags: ci:parallel
# (from function `bail-now' in file test/system/helpers.bash, line 192,
#  from function `die' in file test/system/helpers.bash, line 958,
#  from function `run_podman' in file test/system/helpers.bash, line 560,
#  from function `thingy_with_unique_id' in file test/system/200-pod.bats, line 654,
#  in test file test/system/200-pod.bats, line 686)
#   `thingy_with_unique_id "pod" "--infra=false --name $podname" \' failed
#
# [11:05:02.208657911] $ bin/podman pod create --infra=false --name p-1-t191-say7jqkn
# [11:05:02.370536337] afb9854847372de9ceee76bedfea700a555c9c3de375eaa214fb400daab52c81
#
# [11:05:02.381064307] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c1 quay.io/libpod/testimage:20240123 true
# [11:05:02.449486889] ca6bea2d0e611db24b9b303d0a504e440199420eb0df451c270d27750cfcb4f9
#
# [11:05:02.459926846] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c2 quay.io/libpod/testimage:20240123 true
# [11:05:02.538244015] 7516d2be79a05427c1f0f79e8ef86c50317c8398a6b743795ea37f7f23f66029
#
# [11:05:02.553599237] $ bin/podman container create --pod p-1-t191-say7jqkn --name p-1-t191-say7jqkn-c3 quay.io/libpod/testimage:20240123 true
# [11:05:02.693390469] 6583b505dc0a26986747ca9246e54f3eace1312c869cf8724d9ff717d95b7cc1
#
# [11:05:02.714082921] $ bin/podman pod create --infra=false --name p-2-t191-say7jqkn
# [11:05:03.071879542] Error: unable to create pod cgroup for pod 16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c.slice: Unit user-libpod_pod_16d27df565b8a21a20e6fa5469897cbb02f4e9294137b1bae2b82b321ca6252c.slice was already loaded or has a fragment file.

@kwilczynski
Copy link
Member

kwilczynski commented Sep 11, 2024

@edsantiago, there aren't a lot of changes in 1.17 per:

Would you be able to bisect to see which commit causes this? Unless, perhaps, someone has an idea what it might be...

I also assume that when you use 1.16.1, for example, there aren't any issues?

@edsantiago
Copy link
Member Author

There was never a 1.16 for Fedora (any Fedora), so I jumped from 1.15.1 to 1.17.1.

I can try to bisect tomorrow, but can't promise anything. The bug is hard to reproduce, manifesting only about once in every three to four 15-minute test runs. If you (or anyone) feel like trying on your end:

  1. cherrypick WIP: system test parallelization: two-pass approach podman#23275
  2. hack/bats -T --tag='ci:parallel' &> some-log-file-you-can-examine-later

@edsantiago
Copy link
Member Author

It is possible that the culprit is c6ecb3b

My process: for each bisect step, run at least 10 iterations of my "reproducer". If I did not see the fragment slice failure in ten or more runs, I marked as good. That failure typically manifests within 2-4 runs, but it's still possible that I marked a bad commit as good. Anyhow, I hope this narrows it down.

@edsantiago
Copy link
Member Author

ARGH. I dnf downgraded back to 1.15-1, and now I'm seeing it there too:

# [08:49:43.881619178] $ bin/podman pod start 76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4
# [08:49:44.400350904] Error: starting container 77f23dcb48b247a39687b133b3b51d0972b2b3481066689518a9c3c1f135484f: unable to create pod cgroup for pod 76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4: creating cgroup user.slice/user-14904.slice/user@14904.service/user.slice/user-libpod_pod_76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4.slice: Unit user-libpod_pod_76a36ccaf2ebd1fe4ae1b0ff142006714edda1b9b91d0cd8c3ade3c5301433f4.slice was already loaded or has a fragment file.
# Error: starting container 34dc713c0e058d41027e9b4ecb10186b01cbd723dc67e6033b6ce803bb31a75f: a dependency of container 34dc713c0e058d41027e9b4ecb10186b01cbd723dc67e6033b6ce803bb31a75f failed to start: container state improper

@edsantiago
Copy link
Member Author

This is NOT new in 1.17. It reproduces fairly reliably with 1.15-1.fc40, and the factor seems to be enabling 250-systemd tests in parallel. In particular, the generate systemd - envar test:

# bats test_tags=ci:parallel    <<<<<<<<< THIS IS WHAT MAKES THINGS FAIL
@test "podman generate systemd - envar" {

This test itself never fails, or at least not yet. What it does is cause other tests to fail. I can find nothing in this short test that would not play well with others, unless it's all the systemctl daemon-reloads?

Done for today. I don't know if this is a crun bug, or podman, or systemd.

@edsantiago edsantiago changed the title crun-1.17: new flake: unable to create pod cgroup: slice was already loaded or has a fragment file flake: unable to create pod cgroup: slice was already loaded or has a fragment file Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants