Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

clh: docker kill test hangs randomly #2141

Closed
jcvenegas opened this issue Dec 4, 2019 · 18 comments
Closed

clh: docker kill test hangs randomly #2141

jcvenegas opened this issue Dec 4, 2019 · 18 comments
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@jcvenegas
Copy link
Member

Description of problem

The integration docker test fail randomly

Expected result

All the test passes

Actual result

After a kill signal the container process seems that got it but the agnet request seems to fail after that.

Logs found in a local test.

kill

Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.790995976Z" level=info msg="creating Sandbox" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb
9cb721bda213ef9409 function=createSandbox name=kata-runtime pid=69931 source=virtcontainers subsystem=cloudHypervisor
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.791152376Z" level=debug msg="converting /run/containerd/io.containerd.runtime.v1.linux/moby/9c6f2b7693df66fb8cba43e909208ebf12
c9ce424f40bb9cb721bda213ef9409/config.json" source=virtcontainers subsystem=compatoci
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.792353079Z" level=debug msg="New filesystem store backend for /var/lib/vc/sbs/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9c
b721bda213ef9409/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409" arch=amd64 backend=filesystem command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef94
09 name=kata-runtime path=/var/lib/vc/sbs/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 pid=69931 source=virtconta
iners/store subsystem=store
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.792465779Z" level=debug msg="New filesystem store backend for /run/vc/sbs/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721
bda213ef9409/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409" arch=amd64 backend=filesystem command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 n
ame=kata-runtime path=/run/vc/sbs/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409/9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 pid=69931 source=virtcontainers/st
ore subsystem=store
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.79278088Z" level=info msg="release sandbox" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9c
b721bda213ef9409 name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=sandbox
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.79285058Z" level=info msg="Disconnecting Sandbox Console" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12
c9ce424f40bb9cb721bda213ef9409 function=disconnect name=kata-runtime pid=69931 source=virtcontainers subsystem=cloudHypervisor
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.79293918Z" level=info msg=kill arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9
409 container state=running name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 signal=21 source=runtime
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.79300928Z" level=info msg="fetch sandbox" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb7
21bda213ef9409 name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.793513881Z" level=info msg="endpoint unmarshalled" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424
f40bb9cb721bda213ef9409 endpoint="&{{{83ae6427-ae34-4bce-8e42-7923588b0c6a br0_kata {tap0_kata 02:42:ac:11:00:02 []} [] []} {eth0 0a:65:fd:ee:50:05 []} 2} {{{584 1500 0 eth0 02:42:ac:11:00:02 up|bro
adcast|multicast 69699 585 0 <nil>  0xc0003e0780 0 0xc000382940 ether <nil> up 0 0 0 65536 65535 []} veth} [172.17.0.2/16 eth0] [{Ifindex: 584 Dst: <nil> Src: <nil> Gw: 172.17.0.1 Flags: [] Table: 2
54} {Ifindex: 584 Dst: 172.17.0.0/16 Src: 172.17.0.2 Gw: <nil> Flags: [] Table: 254}] {[]  [] []}} virtual }" endpoint-type=virtual name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208e
bf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=network
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.793701482Z" level=info msg="creating Sandbox" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb
9cb721bda213ef9409 function=createSandbox name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=cloudHypervisor
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.793830982Z" level=debug msg="converting /run/containerd/io.containerd.runtime.v1.linux/moby/9c6f2b7693df66fb8cba43e909208ebf12
c9ce424f40bb9cb721bda213ef9409/config.json" source=virtcontainers subsystem=compatoci
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.794805784Z" level=info msg="New client" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721
bda213ef9409 name=kata-runtime pid=69931 proxy=69678 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=kata_agent url="hvsock:///run/vc/vm/9c6f
2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409/clh.sock:1024"
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.860275229Z" level=debug msg="sending request" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb
9cb721bda213ef9409 name=grpc.SignalProcessRequest pid=69931 req="container_id:\"9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409\" exec_id:\"9c6f2b7693df66fb8cba43e909208ebf12c9ce424
f40bb9cb721bda213ef9409\" signal:21 " sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=kata_agent
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.87891257Z" level=info msg="release sandbox" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9c
b721bda213ef9409 name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=sandbox
Dec 04 19:06:34 jcvenega kata-runtime[69931]: time="2019-12-04T19:06:34.879034471Z" level=info msg="Disconnecting Sandbox Console" arch=amd64 command=kill container=9c6f2b7693df66fb8cba43e909208ebf1
2c9ce424f40bb9cb721bda213ef9409 function=disconnect name=kata-runtime pid=69931 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=virtcontainers subsystem=cloudHypervis
or

agent logs

time="2019-12-04T19:06:34.283629714Z" level=debug msg="request end" debug_console=false duration=242.353644ms name=kata-agent pid=39 request=/grpc.AgentService/ReadStdout resp="data:\"TRAP_R[0/7872]
r\\n\" " sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=agent
time="2019-12-04T19:06:34.292036265Z" level=debug msg="new request" debug_console=false name=kata-agent pid=39 req="container_id:\"9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409\"
exec_id:\"9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409\" len:32768 " request=/grpc.AgentService/ReadStdout sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef940
9 source=agent
time="2019-12-04T19:06:34.622367268Z" level=debug msg="new request" debug_console=false name=kata-agent pid=39 req="container_id:\"9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409\"
exec_id:\"9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409\" signal:21 " request=/grpc.AgentService/SignalProcess sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef
9409 source=agent
time="2019-12-04T19:06:34.631770981Z" level=debug msg="request end" debug_console=false duration="526.236µs" name=kata-agent pid=39 request=/grpc.AgentService/SignalProcess resp="&Empty{XXX_unrecogn
ized:[],}" sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=agent
time="2019-12-04T19:06:35.304953045Z" level=debug msg="request end" debug_console=false duration=1.000723988s name=kata-agent pid=39 request=/grpc.AgentService/ReadStdout resp= sandbox=9c6f2b7693df6
6fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=agent
time="2019-12-04T19:06:35.310675839Z" level=debug msg="process exited" debug_console=false name=kata-agent pid=54 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409 source=agen
t status=21
time="2019-12-04T19:06:35.315042997Z" level=info msg="ignoring unexpected signal" debug_console=false name=kata-agent pid=39 sandbox=9c6f2b7693df66fb8cba43e909208ebf12c9ce424f40bb9cb721bda213ef9409
signal="child exited" source=agent
systemd-tmpfiles-clean.timer: Timer elapsed.
systemd-tmpfiles-clean.service: Trying to enqueue job systemd-tmpfiles-clean.service/start/replace
Added job systemd-tmpfiles-clean.service/start to transaction.
Pulling in system.slice/start from systemd-tmpfiles-clean.service/start
Added job system.slice/start to transaction.
Pulling in -.slice/start from system.slice/start
Added job -.slice/start to transaction.
Pulling in shutdown.target/stop from systemd-tmpfiles-clean.service/start
Added job shutdown.target/stop to transaction.
Found redundant job shutdown.target/stop, dropping from transaction.
Found redundant job -.slice/start, dropping from transaction.
Found redundant job system.slice/start, dropping from transaction.
systemd-tmpfiles-clean.service: Installed new job systemd-tmpfiles-clean.service/start as 38
systemd-tmpfiles-clean.service: Enqueued job systemd-tmpfiles-clean.service/start as 38
systemd-tmpfiles-clean.timer: Changed waiting -> running
systemd-tmpfiles-clean.service: Passing 0 fds to service
systemd-tmpfiles-clean.service: About to execute: /usr/bin/systemd-tmpfiles --clean
systemd-tmpfiles-clean.service: Forked /usr/bin/systemd-tmpfiles as 60
systemd-tmpfiles-clean.service: Changed dead -> start
systemd-tmpfiles-clean.service: Failed to connect stdout to the journal socket, ignoring: No such file or directory
systemd-tmpfiles-clean.service: Kernel keyring not supported, ignoring.
systemd-tmpfiles-clean.service: Executing: /usr/bin/systemd-tmpfiles --clean
Received SIGCHLD from PID 60 (systemd-tmpfile).
Child 60 (systemd-tmpfile) died (code=exited, status=0/SUCCESS)
systemd-tmpfiles-clean.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory
systemd-tmpfiles-clean.service: Child 60 belongs to systemd-tmpfiles-clean.service.
systemd-tmpfiles-clean.service: Main process exited, code=exited, status=0/SUCCESS
systemd-tmpfiles-clean.service: Succeeded.
systemd-tmpfiles-clean.service: Changed start -> dead
systemd-tmpfiles-clean.service: Job 38 systemd-tmpfiles-clean.service/start finished, result=done
systemd-tmpfiles-clean.timer: Got notified about unit deactivation.
systemd-tmpfiles-clean.timer: Monotonic timer elapses in 23h 59min 59.954725s.
systemd-tmpfiles-clean.timer: Changed running -> waiting
systemd-tmpfiles-clean.timer: Monotonic timer elapses in 23h 59min 59.950491s.

@jcvenegas jcvenegas added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Dec 4, 2019
@jcvenegas
Copy link
Member Author

In the current CI debug logs from:
kata-containers/runtime#2369

I get the next traces:

1:14:27 kata-agent.service: Kernel keyring not supported, ignoring.
11:14:27 kata-agent.service: Job 32 kata-agent.service/start finished, result=done
11:14:27 kata-agent.service: Executing: /usr/bin/kata-agent
11:14:27 [  OK  ] Started Kata Containers Agent.
11:14:27 kata-containers.target changed dead -> active
11:14:27 kata-containers.target: Job 1 kata-containers.target/start finished, result=done
11:14:27 [  OK  ] Reached target Kata Containers Agent Target.
11:14:27 Startup finished in 979ms (kernel) + 3.644s (userspace) = 4.623s.
11:14:27 time="2020-01-09T17:14:27.039236015Z" level=info msg=announce debug_console=false device-handlers="mmioblk,blk,blk-ccw,scsi,nvdimm" name=kata-agent pid=39 source=agent storage-handlers="local,9p,virtio-fs,blk,blk-ccw,mmioblk,scsi,ephemeral" system-memory="2045332 kB" version=1.10.0-rc0-1b3628c43660cb198ec055d78e5c7b03809cba60
11:14:27 time="2020-01-09T17:14:27.056976992Z" level=debug msg="Vsock channel type detected" debug_console=false name=kata-agent pid=39 source=agent
11:14:27 time="2020-01-09T17:14:27.066753004Z" level=info msg="Started listening for uevents" debug_console=false name=kata-agent pid=39 source=agent subsystem=udevlistener
11:14:27 time="2020-01-09T17:14:27.103406519Z" level=info msg="agent grpc server starts" debug_console=false name=kata-agent pid=39 source=agent
11:14:27 time="2020-01-09T17:14:27.116930894Z" level=info msg="Waiting for stopServer signal..." debug_console=false name=kata-agent pid=39 source=agent subsystem=stopserverwatcher
11:14:37 Jan 09 17:14:37 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:37.202525566Z" level=info msg="Stop Sandbox" arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f function=stopSandbox name=kata-runtime pid=67769 source=virtcontainers subsystem=cloudHypervisor
11:14:37 Jan 09 17:14:37 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:37.202706878Z" level=info msg="Stopping Cloud Hypervisor" PID=67801 arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f name=kata-runtime pid=67769 source=virtcontainers subsystem=cloudHypervisor
11:14:37 Jan 09 17:14:37 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:37.210943326Z" level=debug msg="virtio_loop: Unexpected poll revents 11" arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f name=kata-runtime pid=67769 source=virtiofsd
11:14:37 Jan 09 17:14:37 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:37.212524831Z" level=debug msg="virtio_loop: Exit" arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f name=kata-runtime pid=67769 source=virtiofsd
11:14:40 Jan 09 17:14:40 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:40.277562783Z" level=warning msg="VM still running after waiting 3s" arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f name=kata-runtime pid=67769 source=virtcontainers subsystem=cloudHypervisor
11:14:40 Jan 09 17:14:40 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:40.277678291Z" level=debug msg="removing vm sockets" arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f name=kata-runtime pid=67769 source=virtcontainers subsystem=cloudHypervisor
11:14:40 Jan 09 17:14:40 ubuntu1804-azure3600b0 kata-runtime[67769]: time="2020-01-09T17:14:40.277770897Z" level=info msg="cleanup vm path" arch=amd64 command=create container=f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f dir=/run/vc/vm/f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f link=/run/vc/vm/f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f name=kata-runtime pid=67769 source=virtcontainers subsystem=cloudHypervisor
11:14:40 command failed error 'exit status 125'
11:14:40 [docker run --cidfile /tmp/cid717289646/BGPQlIfJ0G7djCbDUiwHXoz08PdPh1 --runtime kata-runtime --name BGPQlIfJ0G7djCbDUiwHXoz08PdPh1 -dt busybox sh -c trap "exit 26" 26; echo TRAP_RUNNING; while :; do sleep 1; done]
11:14:40 Timeout: 120 seconds
11:14:40 Exit Code: 125
11:14:40 Stdout: f5f2258347a1122c3491a4e994dba2db8bb52beb98907702e5be39ea8b04572f
11:14:40 
11:14:40 Stderr: docker: Error response from daemon: ttrpc: client shutting down: ttrpc: closed: unknown.
11:14:40 
11:14:40 Running command '/usr/bin/docker [docker logs BGPQlIfJ0G7djCbDUiwHXoz08PdPh1]'
11:14:40 [docker logs BGPQlIfJ0G7djCbDUiwHXoz08PdPh1]
11:14:40 Timeout: 120 seconds
11:14:40 Exit Code: 0
11:14:40 Stdout: 
11:14:40 Stderr: 
11:14:41 Running command '/usr/bin/docker [docker logs ...

Based in that log seems that the VM is being created at start and then for some reason that we dont log something fails.

This spot that the bug may be at create or may be two different bugs also seems that the container took a lot to boot (probably because the amount of log information I enabled but took around 18 seconds.

The other error is at kill test in docker tests, it runs a container but not checks if the docker run command returns an exit code with 0.

I fixed the kill tests error, here, that may improve to help us to know that maybe is create and not really create real bug that makes me think it is related to kata-containers/runtime#2331.

\cc @likebreath @sboeuf

@jcvenegas
Copy link
Member Author

Locally is being likely to be reproduced now, we run while make docker FOCUS="docker kill"; do echo next; done sadly is just easier to repduce without enable debug logs, so make it more difficult to understand what is failing.

Jan 13 16:16:51 jcvenega-debug kata-runtime[55576]: time="2020-01-13T16:16:51.65840287Z" level=info msg="Sandbox already exist, loading from state" arch=amd64 command=state container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=createSandbox name=kata-runtime pid=55576 source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55576]: time="2020-01-13T16:16:51.659944274Z" level=info msg="release sandbox" arch=amd64 command=state container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55576 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=sandbox
Jan 13 16:16:51 jcvenega-debug kata-runtime[55576]: time="2020-01-13T16:16:51.660003074Z" level=info msg="Disconnecting Sandbox Console" arch=amd64 command=state container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=disconnect name=kata-runtime pid=55576 source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.863214832Z" level=info msg="loaded configuration" arch=amd64 command=kill file=/usr/share/defaults/kata-containers/configuration-clh.toml format=TOML name=kata-runtime pid=55607 source=katautils
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.863372732Z" level=info msg="VSOCK supported, configure to not use proxy" arch=amd64 command=kill name=kata-runtime pid=55607 source=katautils
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.863407332Z" level=info arch=amd64 arguments="\"kill 88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f 18\"" command=kill commit=d34d66099f1ef94542e4096580ca9c4f598a7d00-dirty name=kata-runtime pid=55607 source=runtime version=1.10.0-rc0
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.863625233Z" level=info msg="fetch sandbox" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 source=virtcontainers
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.864922635Z" level=info msg="endpoint unmarshalled" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f endpoint="&{{{957e62d0-a128-45e7-a992-2a10bfe823cc br0_kata {tap0_kata 02:42:ac:11:00:03 []} [] []} {eth0 52:2e:a4:e7:80:fd []} 2} {{{1378 1500 0 eth0 02:42:ac:11:00:03 up|broadcast|multicast 69699 1379 0 <nil>  0xc00012a300 0 0xc000433180 ether <nil> up 0 0 0 65536 65535 []} veth} [172.17.0.3/16 eth0] [{Ifindex: 1378 Dst: <nil> Src: <nil> Gw: 172.17.0.1 Flags: [] Table: 254} {Ifindex: 1378 Dst: 172.17.0.0/16 Src: 172.17.0.3 Gw: <nil> Flags: [] Table: 254}] {[]  [] []}} virtual }" endpoint-type=virtual name=kata-runtime pid=55607 source=virtcontainers subsystem=network
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.866516639Z" level=info msg="creating Sandbox" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=createSandbox name=kata-runtime pid=55607 source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.866680639Z" level=info msg="Sandbox already exist, loading from state" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=createSandbox name=kata-runtime pid=55607 source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.867971442Z" level=info msg="release sandbox" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=sandbox
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.868036043Z" level=info msg="Disconnecting Sandbox Console" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=disconnect name=kata-runtime pid=55607 source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.868131843Z" level=info msg=kill arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f container state=running name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f signal=18 source=runtime
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.868199343Z" level=info msg="fetch sandbox" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.869040645Z" level=info msg="endpoint unmarshalled" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f endpoint="&{{{957e62d0-a128-45e7-a992-2a10bfe823cc br0_kata {tap0_kata 02:42:ac:11:00:03 []} [] []} {eth0 52:2e:a4:e7:80:fd []} 2} {{{1378 1500 0 eth0 02:42:ac:11:00:03 up|broadcast|multicast 69699 1379 0 <nil>  0xc00012a540 0 0xc0004b6720 ether <nil> up 0 0 0 65536 65535 []} veth} [172.17.0.3/16 eth0] [{Ifindex: 1378 Dst: <nil> Src: <nil> Gw: 172.17.0.1 Flags: [] Table: 254} {Ifindex: 1378 Dst: 172.17.0.0/16 Src: 172.17.0.3 Gw: <nil> Flags: [] Table: 254}] {[]  [] []}} virtual }" endpoint-type=virtual name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=network
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.870270748Z" level=info msg="creating Sandbox" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=createSandbox name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.870386748Z" level=info msg="Sandbox already exist, loading from state" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=createSandbox name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:16:51 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:16:51.87131425Z" level=info msg="New client" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 proxy=55389 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=kata_agent url="hvsock:///run/vc/vm/88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f/clh.sock:1024"


Jan 13 16:17:06 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:17:06.871628241Z" level=warning msg="DEBUG: ERROR: context deadline exceeded" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=kata_agent
Jan 13 16:17:06 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:17:06.871801642Z" level=info msg="release sandbox" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=sandbox
Jan 13 16:17:06 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:17:06.871914442Z" level=info msg="Disconnecting Sandbox Console" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f function=disconnect name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=virtcontainers subsystem=cloudHypervisor
Jan 13 16:17:06 jcvenega-debug kata-runtime[55607]: time="2020-01-13T16:17:06.871979742Z" level=error msg="context deadline exceeded" arch=amd64 command=kill container=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f name=kata-runtime pid=55607 sandbox=88f1f500231c53c63c173d94d2fea4e89b70cb5d2c53ae26719ce54e8e80c53f source=runtime

@jcvenegas
Copy link
Member Author

jcvenegas commented Jan 13, 2020

Just to verify seems that we hit two different bugs:

Both often are more likely to reproduce using docker kill tests:

docker kill killing container
  with '6'(aborted) signal
  /home/jcvenega/repos/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:43
Running command '/usr/bin/docker [docker run --cidfile /tmp/cid056754573/coVyKV5cJPjL9EcGZ2EBYs73W2WEB1 --runtime kata-runtime --name coVyKV5cJPjL9EcGZ2EBYs73W2WEB1 -dt busybox sh -c trap "e
xit 6" 6; echo TRAP_RUNNING; while :; do sleep 1; done]'
tail: /tmp/serial-clh.log: file truncated
time="2020-01-13T17:59:47.915647501Z" level=info msg=announce debug_console=false device-handlers="nvdimm,mmioblk,blk,blk-ccw,scsi" name=kata-agent pid=38 source=agent storage-handlers="blk-
ccw,mmioblk,scsi,ephemeral,local,9p,virtio-fs,blk" system-memory="2045332 kB" version=1.10.0-rc0-1b3628c43660cb198ec055d78e5c7b03809cba60
time="2020-01-13T17:59:47.933960084Z" level=info msg="Started listening for uevents" debug_console=false name=kata-agent pid=38 source=agent subsystem=udevlistener
time="2020-01-13T17:59:47.96570154Z" level=info msg="agent grpc server starts" debug_console=false name=kata-agent pid=38 source=agent
time="2020-01-13T17:59:47.975888802Z" level=info msg="Waiting for stopServer signal..." debug_console=false name=kata-agent pid=38 source=agent subsystem=stopserverwatcher
command failed error 'exit status 125'
[docker run --cidfile /tmp/cid056754573/coVyKV5cJPjL9EcGZ2EBYs73W2WEB1 --runtime kata-runtime --name coVyKV5cJPjL9EcGZ2EBYs73W2WEB1 -dt busybox sh -c trap "exit 6" 6; echo TRAP_RUNNING; while :; do sleep 1; done]
Timeout: 35 seconds
Exit Code: 125
Stdout: e826b75eb960742c97f2774d02645ef9bd2fa3758e68522b62084ad4d453dde1

Stderr: docker: Error response from daemon: ttrpc: client shutting down: ttrpc: closed: unknown.

Running command '/usr/bin/docker [docker rm -f coVyKV5cJPjL9EcGZ2EBYs73W2WEB1]'
[docker rm -f coVyKV5cJPjL9EcGZ2EBYs73W2WEB1]
Timeout: 35 seconds
Exit Code: 0
Stdout: coVyKV5cJPjL9EcGZ2EBYs73W2WEB1

  • kill fails with context dead line, in other logs I see the create and start commands seems to work fine but then in the next command of docker kill tests is send a signal, the signal is sent to the agent via grpc but when the runtime tries to connect to the agent a context deadline happens, it not clear if the all the VM hangs or just the agent is handling connections (so problable hangs)

@likebreath
Copy link
Contributor

@jcvenegas

To add on reproducing the 'docker kill' test failure, we can have all the debug options enabled in kata's configuration.toml file while getting the test failure. The only problem was I can't enable the serial console log from CLH, as enabling it would mask the test failure.

One more interesting observation to mention: reproducing the test failure is much faster on a QEMU-based VM hosted on my own dev machine, comparing to a Azure VM (where the CI is running on). The QEMU VM takes ~5 minutes to reproduce the test failure, while the Azure VM takes at least 2 hours.

@likebreath
Copy link
Contributor

likebreath commented Jan 15, 2020

As advised by @sboeuf, we collected the output (stdout/stderr) of CLH when it is running with kata-runtime. We did see some error message from the CLH's output: cloud-hypervisor: 1.053427589s: ERROR:vm-virtio/src/vsock/csm/connection.rs:246 -- vsock: error reading from backing stream: lp=1073744420, pp=1024, err=Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }.

However, when CLH got the above error message, the corresponding docker kill test did not fail. Interestingly, when docker kill test failed on docker run ( same as @jcvenegas shared in the previous comment), the corresponding execution of CLH does not complain anything.

Detailed logs are here: CLH's output (here), and the (command-line) output from running the docker kill test (here).

Note: these two logs can be cross-referred to each other based on container ID. For example, the docker kill test failed with contianer d41fd4e67a74d8ba24868014871a1f40a0ec2f63713bbb9739fd0a30d77ade37, while there is no error message from the corresponding CLH's output.

/cc @jcvenegas @sboeuf

@likebreath
Copy link
Contributor

Here are more logs when running CLH with flag -vvv (where we got a lot more logs from CLH). CLH's output is here, and the output from running the docker kill test is here.

Note that the CLH's output log only contains the log from the last execution of CLH (when the kata docker kill test failed), as the complete log file is too big (~145M). I can share if you found it interesting.

/cc @jcvenegas @sboeuf

@sboeuf
Copy link

sboeuf commented Jan 17, 2020

@likebreath @jcvenegas
Ok so based on the latest logs 0116-2-vvv-hcv/, I identified that when the test is failing, we can see the agent is running, and the grpc server is being started, but nothing happens after that until the VM is killed after 15s. So I looked into the kata-runtime logs and I found out that the first communication check with the agent is failing. Basically when virtcontainers/kata_agent.go tries to k.check() which invokes k.connect(), the connect fails, and after that the VM is destroyed.

I identified that 15s delay correspond to the dialer timeout defined here, and which is applied here.
So my understanding is that the connection to the unix socket (yes it's a unix socket on the host side) never succeeds for some reasons.

I want to try adding some sleep before we try to connect, as we might run into some weird behavior from the vsock implementation.
Another idea would be to add more logs to the agent client protocol as we need to identify exactly where this is failing.

@jcvenegas
Copy link
Member Author

Thanks for share your finding guys, I sync with @devimc and @chavafg and pointed to the same changes, and may be the same that is causing problems with kata-containers/runtime#2397

kata-containers/runtime#2402 vendors an old version of the client to see if works better.

@jcvenegas
Copy link
Member Author

Summary current state:

In the case of cloud hypervisor I have tested with the latest code in master, https://github.com/jcvenegas/runtime/tree/clh-master but still the kill issue was reproducible.

Another thing to try is to try to get agent logs in a different way

  • with vsocks via agent collect logs server (but if the bug is at vsocks may not work)
  • try to add an ssh server in the guest VM and connect to it (as we have seen add more or less overhead to the container may affect if can be reproduced the bug so could happed that is not reproducible after implement it.)

@sboeuf
Copy link

sboeuf commented Jan 25, 2020

Thanks for summarizing the status @jcvenegas
I would add that it's mandatory for us to find a way to increase the reproducibility as this will greatly facilitate further debugging.

@egernst
Copy link
Member

egernst commented Jan 28, 2020

could be fud. In the failure I see:

Failing case in logs, we end up starting doing stop-sandbox:

612 Jan 28 22:11:56 chenbdebug kata-runtime[110721]: time="2020-01-28T22:11:56.911057183Z" level=info msg="New client" arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db name=kata-runtime pid=110721 proxy=110767 source=virtcontainers subsystem=kata_agent url="hvsock:///run/vc/vm/d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1    fa572ee9ca3db/clh.sock:1024"
613 Jan 28 22:11:57 chenbdebug kata-runtime[110721]: time="2020-01-28T22:11:57.520355179Z" level=debug msg="fv_queue_set_started: qidx=0 started=1" arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db name=kata-runtime pid=110721 source=virtiofsd
614 Jan 28 22:11:57 chenbdebug kata-runtime[110721]: time="2020-01-28T22:11:57.520491179Z" level=debug msg="fv_queue_set_started: qidx=1 started=1" arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db name=kata-runtime pid=110721 source=virtiofsd
615 Jan 28 22:11:57 chenbdebug kata-runtime[110721]: time="2020-01-28T22:11:57.525592686Z" level=debug msg="fv_queue_thread: Start for queue 1 kick_fd 11" arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db name=kata-runtime pid=110721 source=virtiofsd
616 Jan 28 22:11:57 chenbdebug kata-runtime[110721]: time="2020-01-28T22:11:57.530156292Z" level=debug msg="fv_queue_thread: Start for queue 0 kick_fd 8" arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db name=kata-runtime pid=110721 source=virtiofsd
617 Jan 28 22:12:11 chenbdebug kata-runtime[110721]: time="2020-01-28T22:12:11.911715062Z" level=info msg="Stop Sandbox" arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db function=stopSandbox name=kata-runtime pid=110721 source=virtcontainers subsystem=cloudHypervisor
618 Jan 28 22:12:11 chenbdebug kata-runtime[110721]: time="2020-01-28T22:12:11.911828063Z" level=info msg="Stopping Cloud Hypervisor" PID=110767 arch=amd64 command=create container=d2200b9781bc640e8e9eb21c9e8bf973937f1347e5efb954cb1fa572ee9ca3db name=kata-runtime pid=110721 source=virtcontainers subsystem=cloudHypervisor

In the passing case, we send several requests (successfully):
Passing

355 Jan 28 22:11:43 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:43.254266444Z" level=info msg="proxy started" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 proxy-pid=110315 proxy-url="hvsock:///run/vc/vm/309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b/clh.sock:1024" s    andbox=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b source=virtcontainers subsystem=kata_agent
356 Jan 28 22:11:43 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:43.254467344Z" level=info msg="New client" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 proxy=110315 source=virtcontainers subsystem=kata_agent url="hvsock:///run/vc/vm/309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c9536194    5cec78057259b/clh.sock:1024"
357 Jan 28 22:11:43 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:43.837983106Z" level=debug msg="fv_queue_set_started: qidx=0 started=1" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 source=virtiofsd
358 Jan 28 22:11:43 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:43.838132407Z" level=debug msg="fv_queue_set_started: qidx=1 started=1" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 source=virtiofsd
359 Jan 28 22:11:43 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:43.844744915Z" level=debug msg="fv_queue_thread: Start for queue 0 kick_fd 8" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 source=virtiofsd
360 Jan 28 22:11:43 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:43.849393621Z" level=debug msg="fv_queue_thread: Start for queue 1 kick_fd 11" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 source=virtiofsd
361 Jan 28 22:11:45 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:45.696898135Z" level=debug msg="sending request" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=grpc.CheckRequest pid=110264 req= source=virtcontainers subsystem=kata_agent
362 Jan 28 22:11:45 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:45.705715747Z" level=info msg="New client" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 proxy=110315 source=virtcontainers subsystem=kata_agent url="hvsock:///run/vc/vm/309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c9536194    5cec78057259b/clh.sock:1024"
363 Jan 28 22:11:45 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:45.717443762Z" level=debug msg="sending request" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=grpc.UpdateInterfaceRequest pid=110264 req="interface:<device:\"eth0\" name:\"eth0\" IPAddresses:<address:\"172.17.0.2\" mask:\"16\" > mtu:1500 hwAddr    :\"02:42:ac:11:00:02\" > " source=virtcontainers subsystem=kata_agent
364 Jan 28 22:11:45 chenbdebug kata-runtime[110264]: time="2020-01-28T22:11:45.742323495Z" level=info msg="New client" arch=amd64 command=create container=309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c95361945cec78057259b name=kata-runtime pid=110264 proxy=110315 source=virtcontainers subsystem=kata_agent url="hvsock:///run/vc/vm/309ec410b1f9a41e1d30b804b2f5f4f1cc2c00aab34c9536194    5cec78057259b/clh.sock:1024"

@sboeuf
Copy link

sboeuf commented Jan 29, 2020

@egernst
Yes that's exactly what we've been observing and that's the reason why we think it is related to a problem with hybrid vsock.
In the failing case, grpc.CheckRequest is never printed on the logs because it never got acknowledged by the agent. And because it failed, this led the runtime to trigger the StopSandbox that you can see on the next line.

@jcvenegas
Copy link
Member Author

jcvenegas commented Jan 29, 2020

versions I use to keep debugging, master seems that works well
kata-runtime: kata-containers/runtime@01beb2f + reverted c26ce1867253a17a708ae461f2571765d13068d6
test
tests: 1346395 (master)
agent: f8f6b6754809fa2af926d3bbfaaccd6c4f9981bd (master)

@sboeuf
Copy link

sboeuf commented Feb 4, 2020

1st issue (docker kill)

Everytime the connection hangs up, the read() returns an error, which gets printed by the error!() macro on the stderr. Problem is, if there's no stderr, the print will fail with a panic, ending up killing the virtio_vsock thread.

==> Fixed by cloud-hypervisor/cloud-hypervisor#703. This simply doesn't do anything if there's an error from the print.

==> Kata should be fixed to make sure the process has stderr open until the end, which means it needs to properly wait() for the child (the VMM process).

2nd issue (docker run)

Very sporadically, launching the VM and connecting to it fails. The VM runs nested, which means it might take quite some time for the VM to boot and the agent to be fully ready and start listening onto vsock. Also, looking at straces, it appears clearly that even when the boot succeeds, it takes a lot of tries for the runtime before it gets connected to the agent. That's because hybrid vsock manages the connection between the UNIX and VSOCK sockets. That means when the runtime connects to the UNIX socket and sends CONNECT 1024, the connection will be established by the hybrid vsock device only if the agent already registered himself as listening.

==> One first thing is to limit potential boot slowness that might be caused by the Linux kernel writing to the serial port when it's not there. For this reason, it's better if Kata uses --serial null by default instead of --serial off. Fixed by kata-containers/runtime#2436

==> The global fix is simple, we need to let more time to the connection to retry since we know the agent might take some time to be ready and listen onto vsock. Maybe 60s instead of 15s is acceptable since it will still make the container fails after 1 minute if things are really broken for some other reasons. Fixed by kata-containers/agent#733

==> One additional fix, not directly related to this problem but helpful for global stability, is to handle EINTR error from epoll_wait() syscall. By making sure we don't generate an error from this, but instead retry, we don't miss some data that might have been received on a file descriptor. Fixed by cloud-hypervisor/cloud-hypervisor#707

@likebreath
Copy link
Contributor

likebreath commented Feb 4, 2020

@sboeuf @jcvenegas

Folks, please find below the list of TODOs we discussed over the phone just now. Please ask, if I missed anything. Thanks a lot.

  1. Test and Merge the two PRs (protocols: Fix Cloud-Hypervisor CI regarding hvsock agent#733, virtcontainers: clh: Set the serial to NULL instead of OFF runtime#2436)
    • Bump to latest kata-agent and related agent protocol code in kata-runtime
    • Bump to latest CLH in kata-runtime
  2. Next Steps:
    5.1 Performance analysis of CLH in kata (with comparison to QEMU)

@amshinde
Copy link
Member

amshinde commented Feb 6, 2020

@sboeuf @jcvenegas I havent followed too closely, but is this issue only seen in a nested environment.
Have we run tests for this on bare metal as well, just asking so that we know for sure that this is more of a timing issue. Lets discuss this in the sxt tomorrow.

@jcvenegas
Copy link
Member Author

jcvenegas commented Feb 6, 2020

TODO update:

  1. Test and Merge the two PRs

    • Bump to latest kata-agent and related agent protocol code in kata-runtime
  2. Next Steps:
    5.1 Performance analysis of CLH in kata (with comparison to QEMU)
    Based in nested (azure) enviroment the boot time of a busybox workload is
    3 seconds for qemu
    10 for clh

@likebreath
Copy link
Contributor

This issue is being fixed with a patch to the virtio_vscok driver.
PR is here: kata-containers/packaging#933

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
None yet
Development

No branches or pull requests

5 participants