-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman-remote rmi: device or resource busy #3870
Comments
Thanks Ed. Thinking about this more, I'm not seeing those particular failures anywhere else. So there must be something special about PR #3754. That could simply be new/different VM imges - including images with embedded bugs. Note: The two failures are on F29 and F30 (not Ubuntu) images:
I'll try your reproducer (above) using those images vs what we have on master... |
@edsantiago good news is, I was able to reproduce the error using your script above and the F30 VM image from my PR. Trying the VM image from master (expecting no error) now... |
...as expected, no error using the VM image from master (fedora-30-libpod-5751722641719296). Excellent, so now (hopefully) it's a matter of comparing differences. Since the libpod code is common in both VMs, they're running in the same cloud-environment, so there must be a configuration or packaging difference or something like that. |
Same on both VMs:
hrmmmmm. |
Ran reproducer again, and it worked fine for a while, failing after maybe 10 passes. So there's def. some race condition here that's somehow more likely with the images in my PR vs from master, even with the exact same libpod code 😖 Maybe I should try taking the podman and podman-remote binaries from the breaking-VM over to the working-VM to see if the behavior moves/changes? Not sure what else could be involved here, @mheon have any ideas? |
confirmed: it's not the binaries, behavior remains the same after copying from breaking -> working VM. Noticed this difference though, the breaking VM is running |
JJF**($!!343$hg80*&#//**** 😠 updated both VMs to |
This could definitely cause a race. |
@mheon Is my assumption correct. Does podman container cleanup take care of the --rm as well? |
Thanks @rhatdan. If this is the case, would it be possible to have And if that's not possible, I'd like to request a clear documentation update stating something like:
|
Yes I agree it should be cleaned up, but I am not sure if my theory is correct. |
|
Well @edsantiago is seeing what looks like a race when using the varlink connection. |
Small correction: I've never seen it; only @cevich has seen it. |
@mheon wrong answer. The correct answer is: Oh, I saw this problem yesterday and just merged a fix 😀 @mheon remember this is podman-remote -> podman. Perhaps podman-remote is exiting before podman/conmon finish their job? With Ed's reproducer script, this very reliably reproduces for me (using hack/get_ci_vm.sh on the images mentioned). Would running in debug mode help at all? How can we prove/disprove Dan's theory? Or maybe is there a way we can tell the audit subsystem to log process-exits (with timestamps)? |
@mheon I managed to catch this in the act by instrumenting the system-test that reproduces it: --- a/test/system/070-build.bats
+++ b/test/system/070-build.bats
@@ -26,9 +26,11 @@ EOF
PODMAN_TIMEOUT=240 run_podman build -t build_test --format=docker $tmpdir
is "$output" ".*STEP 4: COMMIT" "COMMIT seen in log"
+ run_podman ps -a
run_podman run --rm build_test cat /$rand_filename
is "$output" "$rand_content" "reading generated file in image"
+ run_podman ps -a
run_podman rmi -f build_test
}
You can see what the test is doing in the diff. When it reproduces, it happens on that last rm -f line:
So clearly a race here is possible, but perhaps (somehow) more likely when podman-remote is being used? is it impossible for the |
Have we ever seen this without remote being involved? This should not be possible for local Podman. |
@mheon As best as I can recall, I have not seen it. Even with podman-remote it doesn't always happen. Is it possible for podman-remote to special-case this, and verify removal happens before exiting (assuming there was no networking-error)? |
Looking at the difference between pkg/adapter/containers.go and pkg/adapter/containers_remote.go.
And containers_remote.go does not. I would guess this is the issue. |
Oh. That sounds like it. |
Oh that's a HUGE relief, thanks @rhatdan ! |
Closing as #3934 is merged |
@cevich is seeing frequent system-test errors [1] with podman-remote
This is (I think) Ubuntu. I cannot reproduce on my Fedora laptop, even running the varlink server with
nice 19
and even with cores disabled to emulate a slow system. My attempted reproducer is a simplified variant of thebuild
system test:My only guess is that this might be a race condition with
podman-remote run --rm
: does anyone know if it is guaranteed to block?master @ 1ad8fe5. I am sorry for not being able to reproduce it but Chris seems to run into it constantly.
The text was updated successfully, but these errors were encountered: