Kata-Containers: Fix kata-containers runtime #8068

cristicalin · 2021-10-09T21:18:03Z

What type of PR is this?

/kind bug

What this PR does / why we need it:
Due to changes in the Kata-containers version 2.1.1 and 2.2.0 we observed issues with starting kata-containers workloads on Ubuntu 20.04 and CentOS 8. This PR fixes missing host kernel modules that are not always automatically loaded and adjusts the kata-containers runtime configuration to bring it in-line with recent versions, in particular changes around cgroups were preventing successful pod creations.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Aditonally to the bugfix, I updated the molecule tests to bring them in-line with the ones performed for the gVisor runtime which better simulate the way kubernetes uses the runtime. This uncovered the configuration changes that were needed in this PR.

Does this PR introduce a user-facing change?:

Fix kata-containers runtime with version 2.x

k8s-ci-robot · 2021-10-09T21:18:10Z

Hi @cristicalin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cristicalin · 2021-10-10T08:07:04Z

For some strange reason our CI just times out on Centos8 jobs even though the same molecule runs run correctly on my own development environment. I'm currently trying with centos7 instead and will fall back to ubuntu18 or debian11 if centos is a no-go.

pasqualet · 2021-10-10T08:33:26Z

roles/container-engine/kata-containers/defaults/main.yml

@@ -5,5 +5,5 @@ kata_containers_containerd_bin_dir: /usr/local/bin

 kata_containers_qemu_default_memory: "{{ ansible_memtotal_mb }}"
 kata_containers_qemu_debug: 'false'
-kata_containers_qemu_sandbox_cgroup_only: 'true'
+kata_containers_qemu_sandbox_cgroup_only: 'false'


Why are we changing this default to false?

As far as I can tell, this breaks kata 2.2.0, the pods remain stuck in Creating state unless I change this to false, also the default in the upstream recommended configuration is false so I thought to revert to the upstream default which seems to work.

I tried upgrading to 2.2.2 in the hope that we can keep the old value for this setting but I got the same error; the pods go into ContainerCreating state and there is no extra logging from the kata runtime itself to explain the failure.

I see the /opt/kata/bin/containerd-shim-kata-v2 but not the auxiliary processes: /opt/kata/libexec/kata-qemu/virtiofsd and /opt/kata/bin/qemu-system-x86_64, this to me means that the shim itself fails to start env though the unit tests seem to pass.

I can't tell for sure due to the behavior of not logging any useful information on behalf of the runtime but I think this is driven by an issue currently investigated upstream in kata-containers/kata-containers#2868 for kata 2.2.x.

I'm going to revert our default to 2.1.1 which appears to work fine (from my own testing at least).

I've been working on fixing cgroupfs with containerd: #8123
It should works fine with kata_containers_qemu_sandbox_cgroup_only='true' and kata 2.2.x

Is this a new requirement introduced by kata 2.2.x? I managed to have working kata 2.1.x and prior with systemd cgroup driver just fine and I did not see any specific point in the release notes about it.

cristicalin · 2021-10-10T13:31:17Z

Sadly, running molecule testing on debian revealed a separate issue with the containerd role for which I don't think a fix is worth pursuing since we are modifying the way containerd is deployed in #7970

cristicalin · 2021-10-20T08:03:07Z

The CI is now fixed.

/cc @floryut @champtar @oomichi

floryut

/ok-to-test

…ail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net

Make CI tests actually pass

cristicalin · 2021-11-09T10:49:26Z

/cc @pasqualet @oomichi

Can we try to merge this now?

k8s-ci-robot · 2021-11-09T13:36:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, floryut, oomichi, pasqualet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [floryut]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

oomichi · 2021-11-09T17:55:38Z

/lgtm

* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net * Kata-containers: use similar testing strategy as gvisor * Kata-Containers: adjust values for 2.2.0 defaults Make CI tests actually pass * Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 9, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 9, 2021

k8s-ci-robot requested review from bozzo and pasqualet October 9, 2021 21:18

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 9, 2021

cristicalin force-pushed the fix_kata_ubuntu branch from 7271a20 to 7de9214 Compare October 10, 2021 07:36

cristicalin force-pushed the fix_kata_ubuntu branch from 7de9214 to 6f435a0 Compare October 10, 2021 09:00

pasqualet reviewed Oct 10, 2021

View reviewed changes

cristicalin force-pushed the fix_kata_ubuntu branch 2 times, most recently from f489898 to 7946c6d Compare October 10, 2021 13:29

cristicalin mentioned this pull request Oct 12, 2021

REQUEST: New membership for cristicalin kubernetes/org#3043

Closed

7 tasks

cristicalin force-pushed the fix_kata_ubuntu branch 2 times, most recently from 180db26 to 3ae7b10 Compare October 19, 2021 17:17

k8s-ci-robot requested review from champtar, floryut and oomichi October 20, 2021 08:03

floryut approved these changes Oct 20, 2021

View reviewed changes

cristicalin force-pushed the fix_kata_ubuntu branch from 3ae7b10 to 36968b4 Compare October 21, 2021 11:11

oomichi approved these changes Oct 21, 2021

View reviewed changes

cristicalin marked this pull request as draft October 21, 2021 17:04

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2021

cristicalin marked this pull request as ready for review October 22, 2021 17:21

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2021

cristicalin mentioned this pull request Oct 24, 2021

Fix Kubelet and Containerd when using cgroupfs #8123

Merged

cristicalin added 4 commits November 5, 2021 16:34

Kata-containes: Fix for ubuntu and centos sometimes kata containers f…

b80b116

…ail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net

Kata-containers: use similar testing strategy as gvisor

6ccf8b2

Kata-Containers: adjust values for 2.2.0 defaults

088aecd

Make CI tests actually pass

Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue

62f28c7

cristicalin force-pushed the fix_kata_ubuntu branch from 6f74a90 to 62f28c7 Compare November 5, 2021 16:35

k8s-ci-robot requested review from oomichi and pasqualet November 9, 2021 10:49

pasqualet approved these changes Nov 9, 2021

View reviewed changes

k8s-ci-robot assigned oomichi Nov 9, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2021

k8s-ci-robot merged commit b7ae4a2 into kubernetes-sigs:master Nov 9, 2021

floryut mentioned this pull request Dec 21, 2021

Release Proposal v2.18 #8325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kata-Containers: Fix kata-containers runtime #8068

Kata-Containers: Fix kata-containers runtime #8068

cristicalin commented Oct 9, 2021 •

edited by floryut

Loading

k8s-ci-robot commented Oct 9, 2021

cristicalin commented Oct 10, 2021

pasqualet Oct 10, 2021

cristicalin Oct 10, 2021

cristicalin Oct 21, 2021

cristicalin Oct 22, 2021

pasqualet Oct 24, 2021

cristicalin Oct 24, 2021

cristicalin commented Oct 10, 2021

cristicalin commented Oct 20, 2021

floryut left a comment

cristicalin commented Nov 9, 2021

k8s-ci-robot commented Nov 9, 2021

oomichi commented Nov 9, 2021

Kata-Containers: Fix kata-containers runtime #8068

Kata-Containers: Fix kata-containers runtime #8068

Conversation

cristicalin commented Oct 9, 2021 • edited by floryut Loading

k8s-ci-robot commented Oct 9, 2021

cristicalin commented Oct 10, 2021

pasqualet Oct 10, 2021

Choose a reason for hiding this comment

cristicalin Oct 10, 2021

Choose a reason for hiding this comment

cristicalin Oct 21, 2021

Choose a reason for hiding this comment

cristicalin Oct 22, 2021

Choose a reason for hiding this comment

pasqualet Oct 24, 2021

Choose a reason for hiding this comment

cristicalin Oct 24, 2021

Choose a reason for hiding this comment

cristicalin commented Oct 10, 2021

cristicalin commented Oct 20, 2021

floryut left a comment

Choose a reason for hiding this comment

cristicalin commented Nov 9, 2021

k8s-ci-robot commented Nov 9, 2021

oomichi commented Nov 9, 2021

cristicalin commented Oct 9, 2021 •

edited by floryut

Loading