Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kata-Containers: Fix kata-containers runtime #8068

Merged
merged 4 commits into from
Nov 9, 2021

Conversation

cristicalin
Copy link
Contributor

@cristicalin cristicalin commented Oct 9, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:
Due to changes in the Kata-containers version 2.1.1 and 2.2.0 we observed issues with starting kata-containers workloads on Ubuntu 20.04 and CentOS 8. This PR fixes missing host kernel modules that are not always automatically loaded and adjusts the kata-containers runtime configuration to bring it in-line with recent versions, in particular changes around cgroups were preventing successful pod creations.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Aditonally to the bugfix, I updated the molecule tests to bring them in-line with the ones performed for the gVisor runtime which better simulate the way kubernetes uses the runtime. This uncovered the configuration changes that were needed in this PR.

Does this PR introduce a user-facing change?:

Fix kata-containers runtime with version 2.x

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 9, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @cristicalin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 9, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 9, 2021
@cristicalin
Copy link
Contributor Author

For some strange reason our CI just times out on Centos8 jobs even though the same molecule runs run correctly on my own development environment. I'm currently trying with centos7 instead and will fall back to ubuntu18 or debian11 if centos is a no-go.

@@ -5,5 +5,5 @@ kata_containers_containerd_bin_dir: /usr/local/bin

kata_containers_qemu_default_memory: "{{ ansible_memtotal_mb }}"
kata_containers_qemu_debug: 'false'
kata_containers_qemu_sandbox_cgroup_only: 'true'
kata_containers_qemu_sandbox_cgroup_only: 'false'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this default to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this breaks kata 2.2.0, the pods remain stuck in Creating state unless I change this to false, also the default in the upstream recommended configuration is false so I thought to revert to the upstream default which seems to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried upgrading to 2.2.2 in the hope that we can keep the old value for this setting but I got the same error; the pods go into ContainerCreating state and there is no extra logging from the kata runtime itself to explain the failure.

I see the /opt/kata/bin/containerd-shim-kata-v2 but not the auxiliary processes: /opt/kata/libexec/kata-qemu/virtiofsd and /opt/kata/bin/qemu-system-x86_64, this to me means that the shim itself fails to start env though the unit tests seem to pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell for sure due to the behavior of not logging any useful information on behalf of the runtime but I think this is driven by an issue currently investigated upstream in kata-containers/kata-containers#2868 for kata 2.2.x.

I'm going to revert our default to 2.1.1 which appears to work fine (from my own testing at least).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on fixing cgroupfs with containerd: #8123
It should works fine with kata_containers_qemu_sandbox_cgroup_only='true' and kata 2.2.x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new requirement introduced by kata 2.2.x? I managed to have working kata 2.1.x and prior with systemd cgroup driver just fine and I did not see any specific point in the release notes about it.

@cristicalin cristicalin force-pushed the fix_kata_ubuntu branch 2 times, most recently from f489898 to 7946c6d Compare October 10, 2021 13:29
@cristicalin
Copy link
Contributor Author

Sadly, running molecule testing on debian revealed a separate issue with the containerd role for which I don't think a fix is worth pursuing since we are modifying the way containerd is deployed in #7970

@cristicalin cristicalin force-pushed the fix_kata_ubuntu branch 2 times, most recently from 180db26 to 3ae7b10 Compare October 19, 2021 17:17
@cristicalin
Copy link
Contributor Author

The CI is now fixed.

/cc @floryut @champtar @oomichi

Copy link
Member

@floryut floryut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 20, 2021
@cristicalin cristicalin marked this pull request as draft October 21, 2021 17:04
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2021
@cristicalin cristicalin marked this pull request as ready for review October 22, 2021 17:21
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2021
@cristicalin
Copy link
Contributor Author

/cc @pasqualet @oomichi

Can we try to merge this now?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, floryut, oomichi, pasqualet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oomichi
Copy link
Contributor

oomichi commented Nov 9, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2021
@k8s-ci-robot k8s-ci-robot merged commit b7ae4a2 into kubernetes-sigs:master Nov 9, 2021
@floryut floryut mentioned this pull request Dec 21, 2021
sakuraiyuta pushed a commit to sakuraiyuta/kubespray that referenced this pull request Apr 16, 2022
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net

* Kata-containers: use similar testing strategy as gvisor

* Kata-Containers: adjust values for 2.2.0 defaults

Make CI tests actually pass

* Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Jun 28, 2023
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net

* Kata-containers: use similar testing strategy as gvisor

* Kata-Containers: adjust values for 2.2.0 defaults

Make CI tests actually pass

* Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants