Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

use_vsock=true gives context deadline exceeded error with ubuntu bionic kernel #1203

Closed
jodh-intel opened this issue Jan 31, 2019 · 34 comments
Closed
Labels
security Potential or actual security issue

Comments

@jodh-intel
Copy link
Contributor

Setting use_vsock=true gives the following error with 1.5.0 (packaged or static):

$ sudo docker run -ti --runtime kata-runtime busybox sh
docker: Error response from daemon: OCI runtime create failed: Failed to check if grpc server is working: context deadline exceeded: unknown.
@jodh-intel jodh-intel added the high-priority Very urgent issue (resolve quickly) label Jan 31, 2019
@jodh-intel
Copy link
Contributor Author

I'm struggling to see how this happened given we have explicit vsock jenkins slaves in our CI:

It's possible though that something broke at the packaging stage? We must get the following setup asap to avoid any chance of that:

@grahamwhaley
Copy link
Contributor

But, then it seems to work for me. I did a fresh F28 server install today with devicemapper and only installed the kata 1.5 static tarball, and I have just run a bunch of kata-fc tests, and tried a kata-qemu as well... Here is my kata-env from the qemu:

$ /opt/kata/bin/kata-qemu kata-env

[Meta]
  Version = "1.0.20"

[Runtime]
  Debug = false
  Trace = false
  DisableGuestSeccomp = true
  DisableNewNetNs = false
  Path = "/opt/kata/bin/kata-runtime"
  [Runtime.Version]
    Semver = "1.5.0"
    Commit = "5f7fcd773089a615b776862f92217e987f06df0a"
    OCI = "1.0.1-dev"
  [Runtime.Config]
    Path = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"

[Hypervisor]
  MachineType = "pc"
  Version = "QEMU emulator version 2.11.2(kata-static)\nCopyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers"
  Path = "/opt/kata/bin/qemu-system-x86_64"
  BlockDeviceDriver = "virtio-scsi"
  EntropySource = "/dev/urandom"
  Msize9p = 8192
  MemorySlots = 10
  Debug = false
  UseVSock = true

[Image]
  Path = "/opt/kata/share/kata-containers/kata-containers-image_clearlinux_1.5.0_agent_a581aebf473.img"

[Kernel]
  Path = "/opt/kata/share/kata-containers/vmlinuz-4.14.67-22"
  Parameters = "init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket"

[Initrd]
  Path = ""

[Proxy]
  Type = "noProxy"
  Version = ""
  Path = ""
  Debug = false

[Shim]
  Type = "kataShim"
  Version = "kata-shim version 1.5.0-efbf3bb25065ce89099630b753d218cdc678e758"
  Path = "/opt/kata/libexec/kata-containers/kata-shim"
  Debug = false

[Agent]
  Type = "kata"

[Host]
  Kernel = "4.16.3-301.fc28.x86_64"
  Architecture = "amd64"
  VMContainerCapable = true
  SupportVSocks = true
  [Host.Distro]
    Name = "Fedora"
    Version = "28"
  [Host.CPU]
    Vendor = "GenuineIntel"
    Model = "Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz"

[Netmon]
  Version = "kata-netmon version 1.5.0"
  Path = "/opt/kata/libexec/kata-containers/kata-netmon"
  Debug = false
  Enable = false

and

$ docker info --format "{{json .Runtimes}}"
{"kata-fc":{"path":"/opt/kata/bin/kata-fc"},"kata-qemu":{"path":"/opt/kata/bin/kata-qemu"},"runc":{"path":"docker-runc"}}
$ docker run --rm -ti --runtime=kata-qemu busybox sh
/ # uname -a
Linux 6f41cd0e986a 4.14.67 #1 SMP Tue Jan 22 23:49:04 CST 2019 x86_64 GNU/Linux
/ # exit
$ uname -a
Linux skull 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Let me know if you want more info, or for me to try to run something else..

@egernst
Copy link
Member

egernst commented Jan 31, 2019

What distro?

Baremetal, or?

@jodh-intel
Copy link
Contributor Author

I'm on Ubuntu 18.04 baremetal.

@jodh-intel
Copy link
Contributor Author

Host kernel 4.15.0-45-generic.

@sboeuf
Copy link

sboeuf commented Jan 31, 2019

same here, 4.15.0-44-generic with Ubuntu 18.04 baremetal

@grahamwhaley
Copy link
Contributor

For more ref then, it looks like the jenkins CI vsocks jobs run on Fedora.

@jodh-intel
Copy link
Contributor Author

@devimc is also on 18.04.

@jodh-intel
Copy link
Contributor Author

jodh-intel commented Jan 31, 2019

F29 works. Ubuntu 18.04 doesn't:

distro host kernel version vsock works?
Ubuntu 18.04 4.15.0-45-generic no
Ubuntu 18.04 4.15.0-44-generic no

@devimc
Copy link

devimc commented Jan 31, 2019

@jodh-intel

distro host kernel version vsock works?
Ubuntu 18.04 4.18.0-1007-azure yes
Ubuntu 18.04 4.18.0-14-generic no

@jodh-intel
Copy link
Contributor Author

F29:

distro host kernel version vsock works?
Fedora 29 4.18.16-300 yes

@jodh-intel
Copy link
Contributor Author

The problem first appeared in 4.15.0-44-generic for bionic:

distro host kernel version vsock works?
Ubuntu 18.04 4.15.0-45-generic no
Ubuntu 18.04 4.15.0-44-generic no
Ubuntu 18.04 4.15.0-43-generic yes
Ubuntu 18.04 4.15.0-39-generic yes

@jodh-intel jodh-intel changed the title use_vsock=true gives context deadline exceeded error use_vsock=true gives context deadline exceeded error with ubuntu bionic kernel Feb 1, 2019
@jodh-intel jodh-intel removed the high-priority Very urgent issue (resolve quickly) label Feb 1, 2019
@jodh-intel
Copy link
Contributor Author

This appears to be an Ubuntu-specific kernel bug. To take this further will require raising the issue on the Ubuntu kernel team on launchpad.net, installing an upstream kernel and tring to recreate with that, plus a few other bits I think. I'm not in a position to do that currently, so throwing it out to the team to help with this one...

I've removed the P2 label as this is not a Kata issue and only affects a single distro as far as we know at this point.

@jodh-intel
Copy link
Contributor Author

Looks like the issue may relate to a fix for CVE-2018-14625:

  • CVE-2018-14625
    - vhost/vsock: fix use-after-free in network stack callers

@devimc
Copy link

devimc commented Feb 1, 2019

nice hunting @jodh-intel

@sboeuf
Copy link

sboeuf commented Feb 1, 2019

Thanks for doing this @jodh-intel.
@egernst any further action needed? Maybe @devimc could get in touch with Canonical folks about this.
We need to make sure this will be fixed in next generic kernels.

@devimc
Copy link

devimc commented Feb 1, 2019

May be Kamal from Canonical can give us some clues about this issue

cc @kamalmostafa

@jodh-intel
Copy link
Contributor Author

Adding the security label as it looks like it might apply.

@jodh-intel jodh-intel added the security Potential or actual security issue label Feb 4, 2019
@pmorjan
Copy link

pmorjan commented Feb 4, 2019

@sboeuf
Copy link

sboeuf commented Feb 4, 2019

Good! Thanks for the heads up @pmorjan

@pmorjan btw, do you have a list of all the Ubuntu kernels affected by this issue? It'd be nice for us if we could list those as not supporting Kata+vsock to our users.

@pmorjan
Copy link

pmorjan commented Feb 4, 2019

@sboeuf
I don't have that list. From my observations 4.15.0-44-generic #47-Ubuntu is the first kernel that's failing. But it's not limitted to Ubuntu. I've seen this on other distros runnig 4.19 and 4.20 kernels.
E.g. current Fedora 28/29 and Arch Linux are also affected.
The fix is in 5.0rc3 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fbe078c37aba3088359c9256c1a1d0c3e39ee81

I created a simple test application to run on the host that can trigger the bug:
https://gist.github.com/pmorjan/9f49c667045f8c0abaf9be3a35f888ef

@jodh-intel
Copy link
Contributor Author

launchpad bug: https://bugs.launchpad.net/bugs/1813934.

@mcastelino
Copy link
Contributor

Folks can we flag this higher. I wasted a lot of time today till I realized this was the issue.

@chavafg
Copy link
Contributor

chavafg commented Feb 7, 2019

We actually started to faced this issue on the vsocks job.
E.g:

19:35:57 Run kata-runtime: nginx:1.14: 
19:35:58 1dd544a859e7098ca5f4abf9c827492404fafe89673e4c83f883505315b0ac5a
19:36:14 docker: Error response from daemon: OCI runtime create failed: Failed to check if grpc server is working: context deadline exceeded: unknown.
19:36:14 Checking 18 containers have all relevant components
19:36:14 pgrep: no matching criteria specified
19:36:14 Try `pgrep --help' for more information.
19:36:14 Wrong number of shims running (18 != 17) - stopping
19:36:14 Wrong number of netmons running (18 != 17) - stopping
19:36:15 Wrong number of 'runtime list' containers running (18 != 17) - stopping
19:36:15 Wrong number of pods in /var/lib/vc/sbs (18 != 17) - stopping)

From http://jenkins.katacontainers.io/job/kata-containers-tests-fedora-vsocks-PR/821/console

@mcastelino
Copy link
Contributor

I updated my system to the latest available mainline kernel on ubuntu and I still see this

$uname -r
4.20.7-042007-generic

@jodh-intel
Copy link
Contributor Author

The launchpad bug shows Ubuntu plan to update their kernel on February 25th. I haven't checked the other distros.

@grahamwhaley, @chavafg - could you add something about vsock tests failing on the dashboard maybe (http://jenkins.katacontainers.io/view/CI%20Status/)? fwics, I think the only CI distro version affected is going to be Fedora since Ubuntu 16.04 and Centos 7 have much older kernels.

@grahamwhaley
Copy link
Contributor

I would add it to the status page (anybody with Jenkins admin rights can do it btw..), but, looking at the Jenkins fedora vsock build job trend ... it seems we have started passing pretty steadily?
http://jenkins.katacontainers.io/job/kata-containers-tests-fedora-vsocks-PR/buildTimeTrend

So, I'm going to hold off and wait for some input from @chavafg

@chavafg
Copy link
Contributor

chavafg commented Feb 7, 2019

@grahamwhaley you are right, I'll add it if it happens again.

@egernst
Copy link
Member

egernst commented Feb 7, 2019

From looking at the actual patch upstream; I think we can mitigate by just masking the CID to the lower 32 bits (or using u32). I think this is a viable work-around, as we already have a collision check anyway. Can someone try this?

/cc @devimc

@devimc
Copy link

devimc commented Feb 7, 2019

the problem is s390x, see #960 (review)

cc @alicefr

@jodh-intel
Copy link
Contributor Author

One workaround is to use the last known working kernel. For example on Ubuntu 18.04:

$ sudo apt-get -y install linux-image-4.15.0-43-generic

Then reboot into that kernel (or tweak grub to select it by default until Canonical push out a kernel that fixes the issue).

@devimc
Copy link

devimc commented Feb 7, 2019

I tried changing everything to uint32 and that didn't work, also I tried with a simple client-server test [1] that uses uint32 for CIDs and I got the same result

2019/02/07 19:06:51 receive: creating file "cpuinfo.txt" for output
2019/02/07 19:06:51 receive: opening listener: 1024
2019/02/07 19:06:51 receive: listening: host(2):1024
2019/02/07 19:06:51 vscp: receive: failed to accept: resource temporarily unavailable

[1] - https://github.com/mdlayher/vsock/tree/master/cmd/vscp

@egernst
Copy link
Member

egernst commented Feb 7, 2019

Thanks @devimc .

It sounds like "go to last known working kernel" or "wait for updated kernel" are our two options, then.

@caoruidong
Copy link
Member

Update to 4.15.0-55-generic and works

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
security Potential or actual security issue
Projects
None yet
Development

No branches or pull requests

9 participants