-
Notifications
You must be signed in to change notification settings - Fork 196
clh: docker kill test hangs randomly #2141
Comments
In the current CI debug logs from: I get the next traces:
Based in that log seems that the VM is being created at start and then for some reason that we dont log something fails. This spot that the bug may be at create or may be two different bugs also seems that the container took a lot to boot (probably because the amount of log information I enabled but took around 18 seconds. The other error is at kill test in docker tests, it runs a container but not checks if the docker run command returns an exit code with 0. I fixed the kill tests error, here, that may improve to help us to know that maybe is create and not really create real bug that makes me think it is related to kata-containers/runtime#2331. \cc @likebreath @sboeuf |
Locally is being likely to be reproduced now, we run
|
Just to verify seems that we hit two different bugs: Both often are more likely to reproduce using docker kill tests:
|
To add on reproducing the 'docker kill' test failure, we can have all the debug options enabled in kata's configuration.toml file while getting the test failure. The only problem was I can't enable the serial console log from CLH, as enabling it would mask the test failure. One more |
As advised by @sboeuf, we collected the output (stdout/stderr) of CLH when it is running with kata-runtime. We did see some error message from the CLH's output: However, when CLH got the above error message, the corresponding Detailed logs are here: CLH's output (here), and the (command-line) output from running the Note: these two logs can be cross-referred to each other based on container ID. For example, the /cc @jcvenegas @sboeuf |
Here are more logs when running CLH with flag Note that the CLH's output log only contains the log from the last execution of CLH (when the kata /cc @jcvenegas @sboeuf |
@likebreath @jcvenegas I identified that 15s delay correspond to the dialer timeout defined here, and which is applied here. I want to try adding some sleep before we try to connect, as we might run into some weird behavior from the vsock implementation. |
Thanks for share your finding guys, I sync with @devimc and @chavafg and pointed to the same changes, and may be the same that is causing problems with kata-containers/runtime#2397 kata-containers/runtime#2402 vendors an old version of the client to see if works better. |
Summary current state:
In the case of cloud hypervisor I have tested with the latest code in master, https://github.com/jcvenegas/runtime/tree/clh-master but still the kill issue was reproducible. Another thing to try is to try to get agent logs in a different way
|
Thanks for summarizing the status @jcvenegas |
could be fud. In the failure I see: Failing case in logs, we end up starting doing
In the passing case, we send several requests (successfully):
|
@egernst |
versions I use to keep debugging, master seems that works well |
1st issue (
|
Folks, please find below the list of TODOs we discussed over the phone just now. Please ask, if I missed anything. Thanks a lot.
|
@sboeuf @jcvenegas I havent followed too closely, but is this issue only seen in a nested environment. |
TODO update:
|
This issue is being fixed with a patch to the virtio_vscok driver. |
Description of problem
The integration docker test fail randomly
Expected result
All the test passes
Actual result
After a kill signal the container process seems that got it but the agnet request seems to fail after that.
Logs found in a local test.
kill
agent logs
The text was updated successfully, but these errors were encountered: