-
Notifications
You must be signed in to change notification settings - Fork 374
Kata-runtime getting stuck, causing CRI-O to hang #396
Comments
Interestingly, looking at the logs when the
I'm not sure why the CLI is hanging... |
Hi @jamiehannaford - these logs don't appear to be from the from same build - note the |
Again, it would be helpful if you could re-run with full debug enabled as you're only seeing some of the messages: |
@jodh-intel I originally installed using apt-get install (as documented here). The binary path which is set in I only cloned this repo and ran Both |
I think I have turned on full debug:
Is that incorrect? |
Yes - there should be four |
@jodh-intel Looking at the TOML file I just pasted, there are 4: Also, I think
Any other container ID works. Some questions:
|
It seems the qemu-lite process for that container isn't running. Here's what a healthy container looks like:
Here's the faulty container causing things to hang:
Is there anything which reaps the proxy, shim and conmon process? Because without that, I assume CRI-O will just keep on requesting its state and wait forever |
Ah sorry! I asked if it was incorrect and you said Yes 😛 So it seems like this is happening: the qemu-lite process got killed for whatever reason, but all the other |
Rebooting the instance fixed the issue. But I'd rather not have to do this every time a Kata container gets stuck 😄 So any tips here would be super appreciated |
That certainly should not be necessary. I think we were still waiting to see your debug logs though (paste in the output of |
@jodh-intel I'm a bit confused. If you look at the TOML file, it has |
The runtime can look for its configuration file in multiple locations - see https://github.com/kata-containers/runtime/#configuration. I suspect you're looking at the "wrong" config file. Note also that the output of |
The config file which is referenced in
|
Hey @jamiehannaford, we have to figure out some proper rollback for cases like this where things does not end up as expected. |
@sboeuf Unfortunately restarting the cluster represents way more of a disruption than just rebooting a node. Would something like this work:
Or it safer to just reboot? I could automate detection (is kubelet/CRIO alive, is kata-runtime responsive, are the processes okay) and rebooting into some kind of controller, but I'd prefer to fix this at root cause if possible. How difficult would it be for kata-runtime to detect this kind of situation? Wondering if I could submit something in a PR. |
Since the Yamux's keepalive has been disabled both on the server and the client side, and this brings a weird issue where the communication between the proxy and the agent hangs. The same issue has been fixed in kata proxy by: "kata-containers/proxy#91". This commit just cherry-pick that patch here to fix the same issue on kata builtin proxy. Fixes: kata-containers#396 Signed-off-by: fupan <lifupan@gmail.com>
Hi @jamiehannaford, is this still an issue for you? Or can we close it? We have been tested newest versions of cri-o and seems to work fine with kata-runtime. |
PPC64LE Support
Description of problem
CRI-O fails to boot sometimes when using Kata.
Expected result
After running
systemctl restart crio
, I expected CRI-O to restart, but it's currently stuck in the initialization procedure and doesn't seem to bind to its sock file. This causes Kubelet to hang too, since it needscrio.sock
to launch its server.Actual result
Looking at the CRI-O logs, it's able to run
kata-runtime state <id>
:This also corresponds to
kata-runtime
logs:However it doesn't get much further than that. I can't seem to list any containers,
kata-runtime list
command just hangs. I can't run anycrictl
command (they all just hang).kata-collect-data.sh
output here: https://gist.github.com/jamiehannaford/43a323ab0866d9db393c85536c41a366The text was updated successfully, but these errors were encountered: