-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kata-Containers: Fix kata-containers runtime #8068
Kata-Containers: Fix kata-containers runtime #8068
Conversation
Hi @cristicalin. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7271a20
to
7de9214
Compare
For some strange reason our CI just times out on Centos8 jobs even though the same molecule runs run correctly on my own development environment. I'm currently trying with centos7 instead and will fall back to ubuntu18 or debian11 if centos is a no-go. |
7de9214
to
6f435a0
Compare
@@ -5,5 +5,5 @@ kata_containers_containerd_bin_dir: /usr/local/bin | |||
|
|||
kata_containers_qemu_default_memory: "{{ ansible_memtotal_mb }}" | |||
kata_containers_qemu_debug: 'false' | |||
kata_containers_qemu_sandbox_cgroup_only: 'true' | |||
kata_containers_qemu_sandbox_cgroup_only: 'false' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing this default to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this breaks kata 2.2.0, the pods remain stuck in Creating state unless I change this to false
, also the default in the upstream recommended configuration is false
so I thought to revert to the upstream default which seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried upgrading to 2.2.2 in the hope that we can keep the old value for this setting but I got the same error; the pods go into ContainerCreating
state and there is no extra logging from the kata runtime itself to explain the failure.
I see the /opt/kata/bin/containerd-shim-kata-v2
but not the auxiliary processes: /opt/kata/libexec/kata-qemu/virtiofsd
and /opt/kata/bin/qemu-system-x86_64
, this to me means that the shim itself fails to start env though the unit tests seem to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell for sure due to the behavior of not logging any useful information on behalf of the runtime but I think this is driven by an issue currently investigated upstream in kata-containers/kata-containers#2868 for kata 2.2.x.
I'm going to revert our default to 2.1.1 which appears to work fine (from my own testing at least).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been working on fixing cgroupfs with containerd: #8123
It should works fine with kata_containers_qemu_sandbox_cgroup_only='true' and kata 2.2.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new requirement introduced by kata 2.2.x? I managed to have working kata 2.1.x and prior with systemd cgroup driver just fine and I did not see any specific point in the release notes about it.
f489898
to
7946c6d
Compare
Sadly, running molecule testing on debian revealed a separate issue with the containerd role for which I don't think a fix is worth pursuing since we are modifying the way containerd is deployed in #7970 |
180db26
to
3ae7b10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
3ae7b10
to
36968b4
Compare
…ail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net
Make CI tests actually pass
6f74a90
to
62f28c7
Compare
/cc @pasqualet @oomichi Can we try to merge this now? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cristicalin, floryut, oomichi, pasqualet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net * Kata-containers: use similar testing strategy as gvisor * Kata-Containers: adjust values for 2.2.0 defaults Make CI tests actually pass * Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net * Kata-containers: use similar testing strategy as gvisor * Kata-Containers: adjust values for 2.2.0 defaults Make CI tests actually pass * Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
What type of PR is this?
/kind bug
What this PR does / why we need it:
Due to changes in the Kata-containers version 2.1.1 and 2.2.0 we observed issues with starting kata-containers workloads on Ubuntu 20.04 and CentOS 8. This PR fixes missing host kernel modules that are not always automatically loaded and adjusts the kata-containers runtime configuration to bring it in-line with recent versions, in particular changes around cgroups were preventing successful pod creations.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Aditonally to the bugfix, I updated the molecule tests to bring them in-line with the ones performed for the gVisor runtime which better simulate the way kubernetes uses the runtime. This uncovered the configuration changes that were needed in this PR.
Does this PR introduce a user-facing change?: