podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

edsantiago · 2024-02-08T17:08:40Z

<+012ms> # $ podman pod rm test_pod
<+328ms> # time="2024-02-08T08:28:16-06:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # 2ef9c5c99f00484285a5a7394e9d276d97c436311f197e7d2a0fa05ea78e69b5

Seen mostly in system tests, BUT ALSO IN E2E TESTS WHERE IT IS IMPOSSIBLE TO DETECT. Because it happens in test cleanup, where rm does not trigger a test failure. So I caught all these by accident. THIS IS HAPPENING A LOT MORE THAN WE ARE AWARE OF.

fedora-38 : sys podman fedora-38 rootless host boltdb
- PR Upgrade tests: reenable, but revamped #21535
  - 02-08 09:30 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host sqlite
- 02-06 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30 09:16 in Podman kube play override with tcp should keep udp from YAML file
- 01-18 08:40 in Podman kube play secret as volume support - multiple volumes
fedora-39β : sys podman fedora-39β rootless host sqlite
- PR CI: test overlay and vfs #20161
  - 10-17-2023 15:39 in [sys] podman play --service-container
rawhide : int podman rawhide rootless host sqlite
- 02-06 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR AppleHV: update LastUp time #21246
  - 01-12 13:11 in [sys] [700] podman play

x	x	x	x	x	x
int(5)	podman(8)	rawhide(3)	rootless(8)	host(8)	sqlite(7)
sys(3)		fedora-39(3)			boltdb(1)
		fedora-39β(1)
		fedora-38(1)

The text was updated successfully, but these errors were encountered:

Luap99 · 2024-02-08T17:21:58Z

This error is new with my c/common network rework. Although I am very sure the bug is not there, I suspect for a very long time (also hinting from other flakes) that it is possible that we cleanup twice.
I just assume that the ENOENT flakes from #19721 can now look like this.

https://api.cirrus-ci.com/v1/artifact/task/6507554541404160/html/int-podman-fedora-39-rootless-host-sqlite.log.html#t--Podman-kube-play-test-with-reserved-init-annotation-in-yaml--1
also shows it together with the other errors that indicate the double cleanup.

edsantiago · 2024-02-29T13:13:15Z

All quiet since initial filing, but just saw this one, f38. The "No such device" error is new:

<+025ms> # $ podman pod rm test_pod
<+316ms> # time="2024-02-27T14:03:06-06:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-02-27T14:03:06-06:00" level=error msg="Unable to clean up network for container e8108e2498248fbcea80089edf4917ba133c8c2f4579212fde2509c9382f12cd: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- remove aardvark entries: IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""

edsantiago · 2024-03-04T14:26:04Z

After the merge of #18442, this is now happening in AfterEach

Luap99 · 2024-03-04T14:33:21Z

yeah once I have a bit more time I give you some diff for your flake retry PR to capture the full stack traces from within the rootless netns code to see how we end up there in cleanup twice by adding a ton of debugging code that should not be merged into main.

edsantiago · 2024-03-25T19:58:10Z

Checking in. Still rootless-only. Happening even with last week's VMs (pasta 2024-03-20).

fedora-38 : sys podman fedora-38 rootless host boltdb
- PR systests: kube play URL: workaround for ECONNREFUSED #21846
  - 02-27 15:05 in [sys] [700] podman play --service-container
- PR Upgrade tests: reenable, but revamped #21535
  - 02-08-2024 09:30 in [sys] [700] podman play --service-container
  - 03-05 21:31 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-20 11:37 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 03-04 09:06 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 02-06-2024 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30-2024 09:16 in Podman kube play override with tcp should keep udp from YAML file
- 01-18-2024 08:40 in Podman kube play secret as volume support - multiple volumes
fedora-39β : sys podman fedora-39β rootless host sqlite
- PR CI: test overlay and vfs #20161
  - 10-17-2023 15:39 in [sys] podman play --service-container
rawhide : int podman rawhide rootless host sqlite
- 03-23 21:49 in TOP-LEVEL [AfterEach] Podman kube generate with volume
- 02-06-2024 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04-2024 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR AppleHV: update LastUp time #21246
  - 01-12-2024 13:11 in [sys] [700] podman play

x	x	x	x	x	x
int(8)	podman(13)	fedora-39(5)	rootless(13)	host(13)	sqlite(10)
sys(5)		rawhide(4)			boltdb(3)
		fedora-38(3)
		fedora-39β(1)

edsantiago · 2024-05-15T13:26:27Z

Ping, hi. Here's one with two failure messages:

<+010ms> # $ podman pod rm test_pod
<+235ms> # time="2024-05-15T13:02:45Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-05-15T13:02:45Z" level=error msg="Unable to clean up network for container 65a4e407b7384fbc341a6c0bf538968e2c1aae528d0c5d635ac2c17cf4397203: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""

The real failure is probably "unable to clean up network", which could be #19721 or could be yet another something new.

Current flake list (may be incomplete, because some of these may be filed under other issues)

debian-13 : sys podman debian-13 rootless host sqlite
- 05-15 09:06 in [sys] [700] podman play --service-container
fedora-38 : sys podman fedora-38 rootless host boltdb
- PR systests: kube play URL: workaround for ECONNREFUSED #21846
  - 02-27-2024 15:05 in [sys] [700] podman play --service-container
- PR Upgrade tests: reenable, but revamped #21535
  - 02-08-2024 09:30 in [sys] [700] podman play --service-container
  - 03-05-2024 21:31 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-26-2024 14:31 in TOP-LEVEL [AfterEach] Podman kube play multi doc yaml with multiple services, pods and deployments
- 03-20-2024 11:37 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 03-04-2024 09:06 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 02-06-2024 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30-2024 09:16 in Podman kube play override with tcp should keep udp from YAML file
- 01-18-2024 08:40 in Podman kube play secret as volume support - multiple volumes
fedora-39 : sys podman fedora-39 rootless host sqlite
- PR podman --runroot: remove 50 char length restriction #22277
  - 04-05-2024 09:42 in [sys] [700] podman play --service-container
rawhide : int podman rawhide rootless host sqlite
- 03-23-2024 21:49 in TOP-LEVEL [AfterEach] Podman kube generate with volume
- 02-06-2024 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04-2024 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR AppleHV: update LastUp time #21246
  - 01-12-2024 13:11 in [sys] [700] podman play

x	x	x	x	x	x
int(9)	podman(15)	fedora-39(7)	rootless(15)	host(15)	sqlite(12)
sys(6)		rawhide(4)			boltdb(3)
		fedora-38(3)
		debian-13(1)

Luap99 · 2024-05-15T13:51:34Z

The real error is something calls network cleanup twice AFAICT, which also applies to ##19721
I see all kube play test linked here so that at least gives me some point to start with I guess.

Any I really need full stack traces of both process calling cleanup but there is no good way of getting this as likely only one of them errors out like that so I cannot include in the error message.

edsantiago · 2024-07-09T13:35:07Z

This one is really blowing up lately. Maybe the new CI VMs, or maybe the tmpfs PR, or maybe something else entirely. Last 30 days:

debian-13 : int podman debian-13 rootless host sqlite
- PR CI: Use local cache registry #22726
  - 07-08 13:44 in Podman kube play test with reserved init annotation in yaml
debian-13 : sys podman debian-13 rootless host sqlite
- PR fix(deps): update module golang.org/x/term to v0.22.0 #23200
  - 07-04 19:02 in [sys] [700] podman play --service-container
- PR Fix hang boot podman #22985
  - 07-09 05:59 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-08 11:38 in [sys] [700] podman play --service-container
fedora-39 : sys podman fedora-39 rootless host boltdb
- PR libpod: first delete container then cidfile #23203
  - 07-05 05:14 in [sys] [700] podman play --service-container
- PR Fix hang boot podman #22985
  - 07-09 05:58 in [sys] [700] podman play --service-container
- PR test/system: Add test for journald log check in quadlet #22736
  - 06-24 14:19 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-02 15:27 in [sys] [700] podman play --service-container
fedora-40 : sys podman fedora-40 rootless host sqlite
- PR CI: Use local cache registry #22726
  - 07-09 09:11 in [sys] [700] podman play --service-container
rawhide : sys podman rawhide rootless host sqlite
- PR cirrus: add missing test/tools to danger files #23220
  - 07-08 05:40 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-08 18:12 in [sys] [700] podman play --service-container

x	x	x	x	x	x
sys(10)	podman(11)	debian-13(4)	rootless(11)	host(11)	sqlite(7)
int(1)		fedora-39(4)			boltdb(4)
		rawhide(2)
		fedora-40(1)

edsantiago · 2024-07-24T11:05:33Z

This one always seems to happen along with "Unable to clean up network" (#19721). For example:

<+011ms> # $ podman pod rm p-t645-r4ya3vwa
<+322ms> # time="2024-07-23T19:42:19Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-07-23T19:42:20Z" level=error msg="Unable to clean up network for container b60e5ea24ca3b2ddba89ffaf606c7dce5defe6dfc3a20f4c2fa8c836969a24aa: \"1 error occurred:\\n\\t* netavark: IO error: No such file or directory (os error 2)\\n\\n\""
         # b7c966087f24a711938bcf3cb8e4851dec813f14be5b5973d8da5fd91f4ff6bc
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Are they the same bug? Should I close one in favor of the other, and assign them all to the same issue?

debian-13 : int podman debian-13 rootless host sqlite
- PR CI: Use local cache registry #22726
  - 07-08 13:44 in Podman kube play test with reserved init annotation in yaml
  - 07-11 13:43 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
debian-13 : sys podman debian-13 rootless host sqlite
- PR CI: system tests: instrument to allow failure analysis #23378
  - 07-23 15:44 in [sys] [700] podman play --service-container
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-23 17:30 in [sys] [700] podman play --service-container
- PR Using Visual Studio BuildTools as a MinGW alternative #23212
  - 07-08 12:04 in [sys] [700] podman play --service-container
- PR fix(deps): update module golang.org/x/term to v0.22.0 #23200
  - 07-04 19:02 in [sys] [700] podman play --service-container
- PR Fix hang boot podman #22985
  - 07-09 05:59 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-08 11:38 in [sys] [700] podman play --service-container
  - 07-11 23:28 in [sys] [700] podman play --service-container
  - 05-15-2024 09:06 in [sys] [700] podman play --service-container
fedora-38 : sys podman fedora-38 rootless host boltdb
- PR systests: kube play URL: workaround for ECONNREFUSED #21846
  - 02-27-2024 15:05 in [sys] [700] podman play --service-container
- PR Upgrade tests: reenable, but revamped #21535
  - 02-08-2024 09:30 in [sys] [700] podman play --service-container
  - 03-05-2024 21:31 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host boltdb
- 07-11 13:44 in TOP-LEVEL [AfterEach] Podman kube play test get all key-value pairs from optional secret as envs
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-26-2024 14:31 in TOP-LEVEL [AfterEach] Podman kube play multi doc yaml with multiple services, pods and deployments
- 03-20-2024 11:37 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 03-04-2024 09:06 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 02-06-2024 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30-2024 09:16 in Podman kube play override with tcp should keep udp from YAML file
fedora-39 : sys podman fedora-39 rootless host boltdb
- PR libpod: first delete container then cidfile #23203
  - 07-05 05:14 in [sys] [700] podman play --service-container
- PR Fix hang boot podman #22985
  - 07-09 05:58 in [sys] [700] podman play --service-container
- PR test/system: Add test for journald log check in quadlet #22736
  - 06-24-2024 14:19 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-02 15:27 in [sys] [700] podman play --service-container
  - 06-03-2024 13:36 in [sys] [700] podman play --service-container in 4856ms
fedora-39 : sys podman fedora-39 rootless host sqlite
- PR podman --runroot: remove 50 char length restriction #22277
  - 04-05-2024 09:42 in [sys] [700] podman play --service-container
fedora-40 : int podman fedora-40 rootless host sqlite
- 07-11 18:35 in TOP-LEVEL [AfterEach] Podman kube play support container startup probe
- 07-11 09:13 in TOP-LEVEL [AfterEach] Podman kube play test with reserved autoremove annotation in yaml
fedora-40 : sys podman fedora-40 rootless host sqlite
- PR System tests: safe container/image/volume/etc names #23280
  - 07-15 14:21 in [sys] [700] podman play --service-container
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-23 20:39 in [sys] [700] podman play --service-container
- PR Make podman-compose refer to podman-compose(1) when using an external provider #23074
  - 07-12 02:03 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-09 09:11 in [sys] [700] podman play --service-container
  - 07-11 17:03 in [sys] [700] podman kube play --replace external storage
rawhide : int podman rawhide rootless host sqlite
- 07-12 10:22 in TOP-LEVEL [AfterEach] Podman kube play should not rename pod if container in pod has same name
- 07-12 10:22 in TOP-LEVEL [AfterEach] Podman kube play test with reserved Label annotation in yaml
- 07-11 16:52 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml uses env value
- 03-23-2024 21:49 in TOP-LEVEL [AfterEach] Podman kube generate with volume
- 02-06-2024 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04-2024 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR cirrus: add missing test/tools to danger files #23220
  - 07-08 05:40 in [sys] [700] podman play --service-container
- PR CI: Use local cache registry #22726
  - 07-08 18:12 in [sys] [700] podman play --service-container

x	x	x	x	x	x
sys(24)	podman(40)	fedora-39(12)	rootless(40)	host(40)	sqlite(31)
int(16)		debian-13(10)			boltdb(9)
		rawhide(8)
		fedora-40(7)
		fedora-38(3)

Luap99 · 2024-07-24T11:10:18Z

My guess is that they have the same cause, that said I also don't understand what is going on other than we are trying to cleanup the networks twice so I cannot be sure about that.

Anyhow looks like this has gotten much worse so I put this on my priority list for next week.

edsantiago · 2024-07-29T11:57:00Z

Still happening frequently, even in non-Ed PRs. Here's one new error message I hadn't seen before (the middle one, aardvark netns):

<+024ms> # $ podman pod rm p-t115-2c0yuhvq
<+294ms> # time="2024-07-19T07:08:01-05:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # [ERROR netavark::dns::aardvark] aardvark-dns runs in a different netns, dns will not work for this container. To resolve please stop all containers, kill the aardvark-dns process, remove the /run/user/2415/containers/networks/aardvark-dns directory and then start the containers again
         # time="2024-07-19T07:08:01-05:00" level=error msg="Unable to clean up network for container 6b7b957f005d00b8d66cfd93b3fe7393c871ce858708411cdba8e77d394855c4: \"1 error occurred:\\n\\t* netavark: failed to get bridge interface: Netlink error: No such device (os error 19)\\n\\n\""

This was in my parallel-bats PR

Luap99 · 2024-07-29T12:19:58Z

I have been looking at the errors for hours today without having any solid idea what is causing this stuff.

I found one issue around podman unshare --rootless-netns handling whent he command there exits > 0

ERRO[0005] rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 
ERRO[0005] Failed to decrement ref count: rootless netns: write ref counter: open /run/user/1000/containers/networks/rootless-netns/ref-count: no such file or directory

But this is not directly related to the symptoms here. It is very clear from the error sequence that something is calling cleanupNetwork() twice but this should practically be impossible unless the cleanup process was killed after it called netavark but before it wrote the netns to the db (I don't see that happening though).

Your new error looks even weirder. Because netavark tries to delete the veth interface first (and there is no error logged about it so it worked) and only then the bridge (which failed).
Given the aardvark-dns warning what happened is that something deleted the rootless-netns even though it was still in use then the cleanup created a new one and then no longer finding the expected interfaces there... So in short something decrementing the ref counter more than it should which well is what we are seeing in the other errors as well...

Luap99 · 2024-07-29T12:21:30Z

It would be great if we manage to get a local reproducer, I tried running the commands from the failed tests without luck.

edsantiago · 2024-07-29T12:32:58Z

It only fails in kube-related tests. Eliminating int tests and my bats-parallel PR, I see failures only in 700.bats in the podman play --service-container test. When I look there, I see a timeout $PODMAN ... &>/dev/null &. I will look into capturing and displaying that log.

Luap99 · 2024-07-29T12:38:25Z

That sounds like a good idea

edsantiago · 2024-07-29T19:35:36Z

Well, that was useless:

         # cat /tmp/podman-kube-bg.log:
         # Pod:
         # 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063
         # Container:
         # 66f79a26bb2ba4110fc21dfd0bfac8fe5a07bda18e2c6aaf7524952a33d14152
         #
         #
<+015ms> # $ podman pod rm p-t645-xt1p2kpr
<+125ms> # time="2024-07-29T19:15:19Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-07-29T19:15:19Z" level=error msg="Unable to clean up network for container 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""
         # 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063

Looking at the journal (search in-page for 8b00d6) I see this sequence:

Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: podman1: port 1(veth0) entered disabled state
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: veth0 (unregistering): left allmulticast mode
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: veth0 (unregistering): left promiscuous mode
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: podman1: port 1(veth0) entered disabled state
Jul 29 19:15:19 cirrus-task-4883090584109056 systemd[3945]: Started podman-187478.scope.
....
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.708555188 +0000 UTC m=+0.071896479 container remove 66f79a26bb2ba4110fc21dfd0bfac8fe5a07bda18e2c6aaf7524952a33d14152 (image=quay.io/libpod/testimage:20240123, name=p-t645-xt1p2kpr-c-t645-xt1p2kpr, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, app=test, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.731089306 +0000 UTC m=+0.094430598 container remove 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.738002418 +0000 UTC m=+0.101343711 pod remove 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063 (image=, name=p-t645-xt1p2kpr)

...that is, shut down veth0 before the pod-rm. And some systemd stuff nearby. Is it possible that systemd is engaging in some unwanted cleanup/skullduggery?

Luap99 · 2024-07-30T08:43:12Z

well cleanup happens when the ctr stops not when it is removed so it is expected that interfaces are removed before the actual container/pod is removed.

What is interesting here is to look for 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 in the journal which is the infra ctr id.

podman[187012]: 2024-07-29 19:15:16.973147745 +0000 UTC m=+0.320669136 container create 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
systemd[3945]: Started libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
systemd[3945]: Started libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope - libcrun container.
podman[187012]: 2024-07-29 19:15:17.273375993 +0000 UTC m=+0.620897395 container init 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187012]: 2024-07-29 19:15:17.27937793 +0000 UTC m=+0.626899335 container start 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
conmon[187127]: conmon 38ac0841b08c00a8ab84 <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/user.slice/user-5174.slice/user@5174.service/user.slice/user-libpod_pod_8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063.slice/libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope/container/memory.events
podman[187198]: 2024-07-29 19:15:17.860726101 +0000 UTC m=+0.125663852 container died 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, io.buildah.version=1.37.0-dev)
podman[187198]: 2024-07-29 19:15:18.132708377 +0000 UTC m=+0.397646117 container cleanup 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187306]: 2024-07-29 19:15:18.647994364 +0000 UTC m=+0.109348535 container restart 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
systemd[3945]: Started libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
systemd[3945]: Started libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope - libcrun container.
podman[187306]: 2024-07-29 19:15:18.856200067 +0000 UTC m=+0.317554231 container init 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187306]: 2024-07-29 19:15:18.860850626 +0000 UTC m=+0.322204787 container start 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187440]: 2024-07-29 19:15:19.342933574 +0000 UTC m=+0.070542825 container kill 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
conmon[187374]: conmon 38ac0841b08c00a8ab84 <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/user.slice/user-5174.slice/user@5174.service/user.slice/user-libpod_pod_8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063.slice/libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope/container/memory.events
podman[187440]: 2024-07-29 19:15:19.381872905 +0000 UTC m=+0.109482177 container died 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, io.buildah.version=1.37.0-dev)
systemd[3945]: Stopping libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope...
systemd[3945]: Stopped libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
podman[187507]: 2024-07-29 19:15:19.731089306 +0000 UTC m=+0.094430598 container remove 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)

The important bit here is that there is no cleanup event after it got restarted and killed. Generally we expect

container died
container cleanup

But in this case there is no cleanup event at all suggesting that something went wrong during cleanup (possibly the process was killed or we hit some error but neither is logged anywhere in the journal so it is impossible to know)
The the pod rm command sees the container and thinks cleanup wasn't done yet and tries again triggering the error.

In theory when syslog is set the cleanup process should log its errors to syslog (journald) so we can have a look at the errors in CI. Without it podman container cleanup errors will never be logged anywhere. In order to rey to debug containers#21569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

edsantiago · 2024-07-30T16:36:22Z

Reproducer! Not a great one, complicated to set up yadda yadda, but here goes:

1mt with podman source tree, make, etc etc
edit test/system/helpers.bash, add return to the very top of leak_check() and clean_setup()
edit test/system/700*.bats, rename teardown() to cecinestpasteardown() or whatever. Just so it does not get called.
Two rootless terms. In each, run:

$ while :;do hack/bats --rootless 700:service-conta || break;done

Reproduces the out of sync error within minutes for me.

Then instrument podman to your heart's content.

HTH

When using service containers and play kube we create a complicated set of dependencies. First in a pod all conmon/container cgroups are part of one slice, that slice will be removed when the entire pod is stopped resulting in systemd killing all processes that were part in it. Now the issue here is around the working of stopPodIfNeeded() and stopIfOnlyInfraRemains(), once a container is cleaned up it will check if the pod should be stopped depending on the pod ExitPolicy. If this is the case it wil stop all containers in that pod. However in our flaky test we calle podman pod kill which logically killed all containers already. Thus the logic now thinks on cleanup it must stop the pod and calls into pod.stopWithTimeout(). Then there we try to stop but because all containers are already stopped it just throws errors and never gets to the point were it would call Cleanup(). So the code does not do cleanup and eventually calls removePodCgroup() which will cause all conmon and other podman cleanup processes of this pod to be killed. Thus the podman container cleanup process was likely killed while actually trying to the the proper cleanup which leaves us in a bad state. Following commands such as podman pod rm will try to the cleanup again as they see it was not completed but then fail as they are unable to recover from the partial cleanup state. Long term network cleanup needs to be more robust and ideally should be idempotent to handle cases were cleanup was killed in the middle. Fixes containers#21569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

In theory when syslog is set the cleanup process should log its errors to syslog (journald) so we can have a look at the errors in CI. Without it podman container cleanup errors will never be logged anywhere. In order to rey to debug containers#21569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

edsantiago added flakes Flakes from Continuous Integration rootless labels Feb 8, 2024

sbrivio-rh mentioned this issue Feb 14, 2024

Vendor c/common pasta branch for testing #21563

Merged

edsantiago mentioned this issue Jul 29, 2024

pkg/api: do not leak config pointers into specgen #23428

Merged

Luap99 mentioned this issue Jul 30, 2024

test/system: podman play --service-container slog to syslog #23450

Closed

Luap99 mentioned this issue Jul 31, 2024

fix network cleanup flake in play kube #23457

Merged

mheon closed this as completed in #23457 Jul 31, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 11, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

edsantiago commented Feb 8, 2024

Luap99 commented Feb 8, 2024

edsantiago commented Feb 29, 2024

edsantiago commented Mar 4, 2024

Luap99 commented Mar 4, 2024

edsantiago commented Mar 25, 2024

edsantiago commented May 15, 2024

Luap99 commented May 15, 2024

edsantiago commented Jul 9, 2024

edsantiago commented Jul 24, 2024

Luap99 commented Jul 24, 2024

edsantiago commented Jul 29, 2024

Luap99 commented Jul 29, 2024

Luap99 commented Jul 29, 2024

edsantiago commented Jul 29, 2024

Luap99 commented Jul 29, 2024

edsantiago commented Jul 29, 2024

Luap99 commented Jul 30, 2024 •

edited

Loading

edsantiago commented Jul 30, 2024

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

Comments

edsantiago commented Feb 8, 2024

Luap99 commented Feb 8, 2024

edsantiago commented Feb 29, 2024

edsantiago commented Mar 4, 2024

Luap99 commented Mar 4, 2024

edsantiago commented Mar 25, 2024

edsantiago commented May 15, 2024

Luap99 commented May 15, 2024

edsantiago commented Jul 9, 2024

edsantiago commented Jul 24, 2024

Luap99 commented Jul 24, 2024

edsantiago commented Jul 29, 2024

Luap99 commented Jul 29, 2024

Luap99 commented Jul 29, 2024

edsantiago commented Jul 29, 2024

Luap99 commented Jul 29, 2024

edsantiago commented Jul 29, 2024

Luap99 commented Jul 30, 2024 • edited Loading

edsantiago commented Jul 30, 2024

Luap99 commented Jul 30, 2024 •

edited

Loading