Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests are failing with timeout error #425

Closed
inknos opened this issue Sep 12, 2024 · 15 comments · Fixed by #428
Closed

Tests are failing with timeout error #425

inknos opened this issue Sep 12, 2024 · 15 comments · Fixed by #428
Assignees

Comments

@inknos
Copy link
Contributor

inknos commented Sep 12, 2024

Reproduces in GH and locally

  1. pull upstream
  2. make tests

tests get stuck at podman/tests/integration/test_containers.py::ContainersIntegrationTest::test_container_crud

See https://github.com/containers/podman-py/pull/420/checks?check_run_id=28860969899

@inknos inknos self-assigned this Sep 12, 2024
@inknos inknos changed the title Tiests are failing in a timeout Tiests are failing with timeout error Sep 12, 2024
@inknos inknos changed the title Tiests are failing with timeout error Tests are failing with timeout error Sep 12, 2024
@inknos
Copy link
Contributor Author

inknos commented Sep 16, 2024

After bisecting podman, this is the first commit where podman-py tests hang containers/podman@8a94331

@rhatdan
Copy link
Member

rhatdan commented Sep 16, 2024

@Luap99 PTAL

@Luap99
Copy link
Member

Luap99 commented Sep 16, 2024

I am not familiar with the python code here but if you do not call the wait endpoint then the linked commit should not change anything as the code is only used in there unless I am overlooking something.

Looking at your reproducer it doesn't even start the container so there is no reason to call wait either, and I cannot reproduce with your small python example either unless I did something wrong? I just copied your reproducer to a file in this repo and then just run it.

@inknos
Copy link
Contributor Author

inknos commented Sep 16, 2024

apologies I deleted the comment with the reproducer already because it was incorrect.

if you do not call the wait endpoint then the linked commit should not change anything as the code is only used in there unless I am overlooking something

your comment shed some light, now I actually have something that can reproduce it

from podman import PodmanClient
import subprocess

uri = "unix:///run/user/100/podman/podman.sock"
client = PodmanClient(base_uri=uri)

c = client.containers.create('nginx:latest', name="test", detach=True)
c.start()

works

c = client.containers.run('nginx:latest', name="test", detach=True)

hangs

I think I know where to look now. run calls start, wait and reload in sequence

@Luap99
Copy link
Member

Luap99 commented Sep 16, 2024

Yes that makes more sense, I was able to reproduce with the test case once I figure out the dependencies.

I see container.wait(condition=["running", "exited"]) being called there but I really really do not understand the purpose of this call. After start is done it is safe to assume the container is running (or well it could have any other state as well as in the meantime the process could have exited or any other podman command might have changed the state) so this call doesn't seem to do anything useful. In fact the caller asked for detach and we still wait?

@inknos
Copy link
Contributor Author

inknos commented Sep 16, 2024

not sure about this line. I found new issues related to getting the right status. it seems like the call will indeed wait forever since the status looks to be created every time.

>>> c = client.containers.create('nginx:latest', detach=True)
>>> c.status
'created'
>>> c.start()
>>> c.status
'created'

@Luap99
Copy link
Member

Luap99 commented Sep 16, 2024

In theory exited cannot happen in your reproducer however the wait condition should trigger on running as well so something must be wrong with that. The logic should return if any state was hit however it currently blocks forever.

I see what is causing this on the podman side and will look into a fix on the podman side tomorrow.

However I would recommend you drop the wait call there, it is not doing anything useful AFAICS.

the status looks to be created every time.

You will have to call reload() because the python code of course doesn't really understand state changes on the podman side, the API has no real way to update the state fields on its own so you must do an explicit reload which in turn does an inspect on the server and then just updates the attrs filed with that. c.state itself just returns self.attrs["State"]["Status"]

@jwhonce
Copy link
Member

jwhonce commented Sep 16, 2024

@Luap99 The wait call is blocking on the container on the server to be in the correct state. So a reload() is not required. These semantics are to be compatible.

@Luap99
Copy link
Member

Luap99 commented Sep 16, 2024

@Luap99 The wait call is blocking on the container on the server to be in the correct state.

Well if you call start the container is always running on the server, calling wait for running afterwards is just pointless.

@inknos
Copy link
Contributor Author

inknos commented Sep 16, 2024

Should the run function then call wait only in certain cases? I can see that for a detached container does not make much sense, but maybe it does for a container which is not detached. in my opinion, the behaviors should be consistent, so I would cut it tbh.

I think wait makes a lot of sense for scripting purposes, but not within the run function context. after all isn't run just create+start?

@Luap99
Copy link
Member

Luap99 commented Sep 16, 2024

In the non detached case it already calls wait() (without args so it waits for exit) later which is correct, exit_status = container.wait(). I think run should largely follow what podman run is doing.

@jwhonce
Copy link
Member

jwhonce commented Sep 16, 2024

I plead very bad git history, sorry guys. I added that condition without explanation. I want to say the idea was to have the reload() be correct and there was a race condition. I infer this because the wait() was added with integration tests comment.

Over the last three years the server semantics may have changed enough that the race condition no longer exists.

@inknos
Copy link
Contributor Author

inknos commented Sep 17, 2024

@jwhonce, we recently had an update in our docs where reload() was written explicitly for getting the status of a container. I believe wait can be removed and maybe reload too.

Actually, to improve getting the correct status, what about trying to reload directly within in the status property?

@Luap99
Copy link
Member

Luap99 commented Sep 17, 2024

Actually, to improve getting the correct status, what about trying to reload directly within in the status property?

Note that an inspect call is rather expensive. I am not sure how the code is used in general but if one does if c.status == "foo" || c.status == "bar" than having the status change in between might be a bad thing?

@inknos
Copy link
Contributor Author

inknos commented Sep 17, 2024

yeah, that can be an issue. let's not make the status property too smart

Luap99 added a commit to Luap99/libpod that referenced this issue Sep 17, 2024
As it turns on things are not so simple after all...
In podman-py it was reported[1] that waiting might hang, per our docs wait
on multiple conditions should exit once the first one is hit and not all
of them. However because the new wait logic never checked if the context
was cancelled the goroutine kept running until conmon exited and because
we used a waitgroup to wait for all of them to finish it blocked until
that happened.

First we can remove the waitgroup as we only need to wait for one of
them anyway via the channel. While this alone fixes the hang it would
still leak the other goroutine. As there is no way to cancel a goroutine
all the code must check for a cancelled context in the wait loop to no
leak.

Fixes 8a94331 ("libpod: simplify WaitForExit()")
[1] containers/podman-py#425

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
inknos added a commit to inknos/podman-py that referenced this issue Sep 25, 2024
Fixes: containers#425

Signed-off-by: Nicola Sella <nsella@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants