Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Kata Container support #500

Merged
merged 31 commits into from
May 14, 2020
Merged

Kata Container support #500

merged 31 commits into from
May 14, 2020

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Dec 18, 2019

Contains support for creating image files, installing Kata Containers and running E2E tests with that. However, those tests need to be run manually on bare-metal hosts because triple-nested virtualization (like we would have to do in our Azure CI) is too slow.

pohly added 5 commits May 7, 2020 18:33
We cannot always vendor some upstream component. If we can't, then we
can fork it by copying the files into "third-party", either with plain
"cp" or "git subtree", and the add our own changes on top of it.
…d92e31ef976

Forked because upstream author seems inactive (last message is an
apology for being inactive:
frostschutz/go-fibmap#1 (comment)),
so we need to maintain this ourselves.

git-subtree-dir: third-party/go-fibmap
git-subtree-mainline: c451fa6
git-subtree-split: b32c231
The resulting file can be used as backing store for a QEMU nvdimm
device. This is based on the approach that is used for the Kata
Container rootfs
(https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh)
and reuses some of the same code, but also differs from that in some
regards:
- The start of the partition is aligned a multiple of the 2MiB
  huge page size (kata-containers/runtime#2262 (comment)).
- The size of the QEMU object is the same as the nominal size of the
  file. In Kata Containers the size is a fixed 128MiB
  (kata-containers/osbuilder#391 (comment)).
The test needs files prepared as part of a cluster creation, without
running in that cluster itself.
Same change as inside the PMEM-CSI driver itself: we have to ensure
that reflink is off because it is incompatible with "-o dax".
The implementation already worked like that, it just wasn't documented
and thus it was unknown whether reusing the directory also for other
local state (like the upcoming extra volume mounts) is okay.
@pohly pohly force-pushed the kata-containers branch from d6c3b17 to 825d4aa Compare May 10, 2020 16:55
@pohly
Copy link
Contributor Author

pohly commented May 10, 2020

@devimc: this PR now has testing against and instructions for Kata Containers 1.11.0-rc0. Can you perhaps check that the changes in docs look reasonable?

More feedback of course also welcome 😀

I'm pushing this while local tests are still running, but hopefully the new tests work now.

@pohly
Copy link
Contributor Author

pohly commented May 10, 2020

kata-deploy fails in our CI (https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-500/29/console):

kube-system pod/kata-deploy-prvzf 0/1 CrashLoopBackOff 4 4m31s 10.244.1.2

I'll check tomorrow why it fails there. I worked for me locally.

@pohly pohly force-pushed the kata-containers branch from 0b703d7 to 14b3e3f Compare May 11, 2020 07:12
@pohly
Copy link
Contributor Author

pohly commented May 11, 2020

https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-500/lastCompletedBuild/testReport/clear-32690-.lvm-production/E2E/Kata_Containers__Testpattern__Dynamic_PV__ext4___dax_should_support_MAP_SYNC/

May 11 09:07:37.136: INFO: At 2020-05-11 09:02:31 +0000 UTC - event for dax-volume-test-kata: {kubelet pmem-csi-govm-worker1} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: failed to launch qemu: exit status 1, error messages from qemu log: Could not access KVM kernel module: No such file or directory
qemu-system-x86_64: failed to initialize KVM: No such file or directory

Same error also for other nodes. It looks like we don't have nested virtualization enabled in the Azure VM. Let me see whether I can change that...

Copy link

@devimc devimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @pohly - lgtm

docs/design.md Outdated

This gets solved as follows:
- PMEM-CSI creates a volume as usual, either in direct mode or LVM mode.
- Inside that volume it sets up an ext4 filesystem.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jfyi - xfs is also supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@pohly
Copy link
Contributor Author

pohly commented May 11, 2020

Progress (?): without the -vmx workaround in Jenkinsfile (i.e. plain -cpu host), /dev/kvm appears in the nodes, but it then fails with "Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: Failed to check if grpc server is working: rpc error: code = Unavailable desc = transport is closing".

@pohly
Copy link
Contributor Author

pohly commented May 11, 2020

"Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: Failed to check if grpc server is working: rpc error: code = Unavailable desc = transport is closing".

@devimc explained on IRC that triple nesting of VMs makes the inner QEMU so slow that the kubelet -> CRI communication times out. This means we cannot test with Kata Containers in the current Azure CI.

I'll make the tests optional. Until we have automatic testing on real hardware (BMaaS!), we'll simply have to test them manually from time to time on real hardware to detect regressions.

@pohly pohly force-pushed the kata-containers branch 2 times, most recently from 2060858 to a6e0943 Compare May 12, 2020 07:49
@pohly pohly changed the title WIP: Kata Container support Kata Container support May 12, 2020
@pohly
Copy link
Contributor Author

pohly commented May 12, 2020

Tests are clean now after disabling the Kata Containers tests in our Azure CI.

@avalluri: okay to merge?

Comment on lines 10 to 11
reclaimPolicy: Delete
volumeBindingMode: Immediate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per [app yaml(https://github.com//pull/500/files#diff-17203e3a5882efb0c1944558564a9c52R10-R11] , the application is expected to run only on nodes with katacontainers.io/kata-runtime: "true". So, if we use immediate binding mode might end up creating a node on the wrong node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I should better switch this to "late-binding".

docs/design.md Outdated
space available in the volume.
- That partition is bound to a `/dev/loop` device and the formatted
with the requested filesystem type for the volume.
- When an applications needs access to the volume, PMEM-CSI mounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: an applications -> an application

@avalluri
Copy link
Contributor

@pohly looks good to me. Just go ahead and merge after fixing the storage classes and the documentation nit.

pohly added 6 commits May 13, 2020 18:01
A persistent or ephemeral volume can either be prepared for usage by
DAX-enabled applications that don't run under Kata Containers (the
default) or for DAX-enabled applications that run under Kata
Containers.

In both cases the volume can be used with and with Kata Containers,
it's just that DAX only works either inside or ourside of Kata
Containers.

The Kata Container runtime must be able to access the image file while
it is still mounted, therefore we cannot use something inside the
target dir as mount point, because then the image file is shadowed by
the mounted filesystem.

We already have a local state dir for .json files. Putting something
else inside it might confuse the state code, so instead we create a
second directory with ".mount" appended to the directory name and use
that for mount points.

We also have to enable bi-directional mount propagation for it because
otherwise the mounted fs with the image file is still only visible
inside the container).
It can happen that Kubernetes comes up, but something else (like Kata
Containers) doesn't. In that case "kubectl get all" may provide some
hint.
Two minutes was enough locally, but not for the CI.
"make start" in an empty _work failed with:

tar zxf _work/govm_0.9-alpha_Linux_amd64.tar.gz -C _work/bin/
tar: _work/bin: Cannot open: No such file or directory
Due to a race condition (?), kata-deploy fails in the CI because
/etc/crio/crio.conf didn't exist at the time that it ran:

$ kubectl logs -n kube-system kata-deploy-2dh2f
copying kata artifacts onto host
Add Kata Containers as a supported runtime for CRIO:
cp: cannot stat '/etc/crio/crio.conf': No such file or directory

Somehow it worked locally.
Even with the "-vmx" override in the Jenkinsfile removed, nested
virtualization with three levels (Azure HyperV -> QEMU (govm) ->
QEMU (Kata Containers)) was not working well enough for Kata
Containers: because the inner VM runs very slowly, there are timeouts
in the communication between kubelet and Kata Containers ("container
create failed: Failed to check if grpc server is working: rpc error:
code = Unavailable desc = transport is closing").

That means that testing with Kata Containers has to be limited to bare
metal. To achive that, it's turned off by default (and thus in the CI,
which only runs on Azure) and has to be enabled with
TEST_KATA_CONTAINERS_VERSION=1.11.0-rc0 or by invoking
test/setup-kata-containers.sh manually.
@pohly pohly force-pushed the kata-containers branch from a6e0943 to 4dcfcf5 Compare May 13, 2020 16:01
@pohly pohly merged commit 60e44b1 into intel:devel May 14, 2020
pohly added a commit to pohly/pmem-CSI that referenced this pull request May 15, 2020
The Kata Containers PR (intel#500)
and tightening docsite
validation (intel#640) were merged
independently without retesting, which broke "devel" because some of
the changes for Kata Containers caused warnings which are now errors.
pohly added a commit to pohly/pmem-CSI that referenced this pull request May 15, 2020
The Kata Containers PR (intel#500)
and tightening docsite
validation (intel#640) were merged
independently without retesting, which broke "devel" because some of
the changes for Kata Containers caused warnings which are now errors.
pohly added a commit to pohly/pmem-CSI that referenced this pull request May 15, 2020
The Kata Containers PR (intel#500)
and tightening docsite
validation (intel#640) were merged
independently without retesting, which broke "devel" because some of
the changes for Kata Containers caused warnings which are now errors.
pohly added a commit to pohly/pmem-CSI that referenced this pull request May 16, 2020
The Kata Containers PR (intel#500)
and tightening docsite
validation (intel#640) were merged
independently without retesting, which broke "devel" because some of
the changes for Kata Containers caused warnings which are now errors.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants