-
Notifications
You must be signed in to change notification settings - Fork 374
use_vsock=true gives context deadline exceeded error with ubuntu bionic kernel #1203
Comments
I'm struggling to see how this happened given we have explicit vsock jenkins slaves in our CI: It's possible though that something broke at the packaging stage? We must get the following setup asap to avoid any chance of that: |
But, then it seems to work for me. I did a fresh F28 server install today with devicemapper and only installed the kata 1.5 static tarball, and I have just run a bunch of kata-fc tests, and tried a kata-qemu as well... Here is my kata-env from the qemu:
[Meta]
Version = "1.0.20"
[Runtime]
Debug = false
Trace = false
DisableGuestSeccomp = true
DisableNewNetNs = false
Path = "/opt/kata/bin/kata-runtime"
[Runtime.Version]
Semver = "1.5.0"
Commit = "5f7fcd773089a615b776862f92217e987f06df0a"
OCI = "1.0.1-dev"
[Runtime.Config]
Path = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
[Hypervisor]
MachineType = "pc"
Version = "QEMU emulator version 2.11.2(kata-static)\nCopyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers"
Path = "/opt/kata/bin/qemu-system-x86_64"
BlockDeviceDriver = "virtio-scsi"
EntropySource = "/dev/urandom"
Msize9p = 8192
MemorySlots = 10
Debug = false
UseVSock = true
[Image]
Path = "/opt/kata/share/kata-containers/kata-containers-image_clearlinux_1.5.0_agent_a581aebf473.img"
[Kernel]
Path = "/opt/kata/share/kata-containers/vmlinuz-4.14.67-22"
Parameters = "init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket"
[Initrd]
Path = ""
[Proxy]
Type = "noProxy"
Version = ""
Path = ""
Debug = false
[Shim]
Type = "kataShim"
Version = "kata-shim version 1.5.0-efbf3bb25065ce89099630b753d218cdc678e758"
Path = "/opt/kata/libexec/kata-containers/kata-shim"
Debug = false
[Agent]
Type = "kata"
[Host]
Kernel = "4.16.3-301.fc28.x86_64"
Architecture = "amd64"
VMContainerCapable = true
SupportVSocks = true
[Host.Distro]
Name = "Fedora"
Version = "28"
[Host.CPU]
Vendor = "GenuineIntel"
Model = "Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz"
[Netmon]
Version = "kata-netmon version 1.5.0"
Path = "/opt/kata/libexec/kata-containers/kata-netmon"
Debug = false
Enable = false and $ docker info --format "{{json .Runtimes}}"
{"kata-fc":{"path":"/opt/kata/bin/kata-fc"},"kata-qemu":{"path":"/opt/kata/bin/kata-qemu"},"runc":{"path":"docker-runc"}}
$ docker run --rm -ti --runtime=kata-qemu busybox sh
/ # uname -a
Linux 6f41cd0e986a 4.14.67 #1 SMP Tue Jan 22 23:49:04 CST 2019 x86_64 GNU/Linux
/ # exit
$ uname -a
Linux skull 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Let me know if you want more info, or for me to try to run something else.. |
What distro? Baremetal, or? |
I'm on Ubuntu 18.04 baremetal. |
Host kernel 4.15.0-45-generic. |
same here, 4.15.0-44-generic with Ubuntu 18.04 baremetal |
For more ref then, it looks like the jenkins CI vsocks jobs run on Fedora. |
@devimc is also on 18.04. |
F29 works. Ubuntu 18.04 doesn't:
|
|
F29:
|
The problem first appeared in
|
This appears to be an Ubuntu-specific kernel bug. To take this further will require raising the issue on the Ubuntu kernel team on launchpad.net, installing an upstream kernel and tring to recreate with that, plus a few other bits I think. I'm not in a position to do that currently, so throwing it out to the team to help with this one... I've removed the P2 label as this is not a Kata issue and only affects a single distro as far as we know at this point. |
Looks like the issue may relate to a fix for CVE-2018-14625:
|
nice hunting @jodh-intel |
Thanks for doing this @jodh-intel. |
May be Kamal from Canonical can give us some clues about this issue |
Adding the security label as it looks like it might apply. |
A fix has been committed to the kernel tree: |
@sboeuf I created a simple test application to run on the host that can trigger the bug: |
launchpad bug: https://bugs.launchpad.net/bugs/1813934. |
Folks can we flag this higher. I wasted a lot of time today till I realized this was the issue. |
We actually started to faced this issue on the vsocks job.
From http://jenkins.katacontainers.io/job/kata-containers-tests-fedora-vsocks-PR/821/console |
I updated my system to the latest available mainline kernel on ubuntu and I still see this
|
The launchpad bug shows Ubuntu plan to update their kernel on February 25th. I haven't checked the other distros. @grahamwhaley, @chavafg - could you add something about vsock tests failing on the dashboard maybe (http://jenkins.katacontainers.io/view/CI%20Status/)? fwics, I think the only CI distro version affected is going to be Fedora since Ubuntu 16.04 and Centos 7 have much older kernels. |
I would add it to the status page (anybody with Jenkins admin rights can do it btw..), but, looking at the Jenkins fedora vsock build job trend ... it seems we have started passing pretty steadily? So, I'm going to hold off and wait for some input from @chavafg |
@grahamwhaley you are right, I'll add it if it happens again. |
From looking at the actual patch upstream; I think we can mitigate by just masking the CID to the lower 32 bits (or using u32). I think this is a viable work-around, as we already have a collision check anyway. Can someone try this? /cc @devimc |
the problem is s390x, see #960 (review) cc @alicefr |
One workaround is to use the last known working kernel. For example on Ubuntu 18.04: $ sudo apt-get -y install linux-image-4.15.0-43-generic Then reboot into that kernel (or tweak grub to select it by default until Canonical push out a kernel that fixes the issue). |
I tried changing everything to uint32 and that didn't work, also I tried with a simple client-server test [1] that uses uint32 for CIDs and I got the same result
[1] - https://github.com/mdlayher/vsock/tree/master/cmd/vscp |
Thanks @devimc . It sounds like "go to last known working kernel" or "wait for updated kernel" are our two options, then. |
Update to 4.15.0-55-generic and works |
Setting
use_vsock=true
gives the following error with 1.5.0 (packaged or static):The text was updated successfully, but these errors were encountered: