Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

[RFC] Support multiple isolation mechanisms using Kata #1082

Closed
mcastelino opened this issue Jan 2, 2019 · 21 comments
Closed

[RFC] Support multiple isolation mechanisms using Kata #1082

mcastelino opened this issue Jan 2, 2019 · 21 comments

Comments

@mcastelino
Copy link
Contributor

Description of problem

Kata recently added support for the Firecracker VMM. What that means is that we can now support different hypervisors in Kata.

However it is more than just that. Firecracker is a VMM, but in reality it is a different enough that it can be considered a different isolation mechanism even though the underlying hardware framework is VT-x.
For example the resource profile/model of a Firecracker POD will be significantly different. This is in addition to any security differences (i.e. sandbox capabilities) and limitations.

We have multiple ways of exposing these isolation mechanisms to the end user of kubernetes

  1. Single runtime with different annotations within the POD (which becomes more kata specific).
    This may not be ideal just like non standard annotations that came before.

  2. Expose it directly as different runtime classes.
    This means that there may have to be two different kata binaries with different configurations, or the same kata binary that chooses a different configuration based on argv[0] (like busybox).
    So kata-runtime-firecracker would chooser configuration-firecracker.toml as kata already has support for

GLOBAL OPTIONS:
   --kata-config value               Kata Containers config file path

Note: Both 1 and 2 are very kata specific.

  1. Expose at the runtime class level by allowing the runtime class spec to have more properties. This would allow us to use the same runtime binary but expose it at the top level as multiple runtime classes each with different properties and limitations.
type RuntimeClassSpec struct {
    // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container
    // creation. The possible values are specific to a given configuration & CRI implementation.
    // The empty string is equivalent to the default behavior.
    // +optional
    RuntimeHandler    string

    // Properties to be passed to the runtime on each invocation
    RuntimeProperties map[string]string
}

Note: 3 is requires changes to both kubernetes and crio/containerd.

  1. Enhance crio and containerd to support specifying configuration along with the runtime binary. This is the equivalent of 2 but a cleaner and natural abstraction for other runtimes to reuse. This also allows us to use runtime class the way it is defined and implemented in kubernetes today.

Hence the crio.conf would look like

[crio.runtime.runtimes.kata]
runtime_path = "/usr/bin/kata-runtime"

[crio.runtime.runtimes.katafirecracker]
runtime_path = "/usr/bin/kata-runtime"
runtime_params = "--kata-config /etc/kata-runtime/firecracker.toml"

And the RuntimeHandler would need to be enhanced along the lines of

https://github.com/kubernetes-sigs/cri-o/blob/master/oci/oci.go#L125

   // config table.
   type RuntimeHandler struct {
   RuntimePath string `toml:"runtime_path"`
   RuntimeParams string `toml:"runtime_params"`
   }

Note: 4, would not need any changes to kubernetes. It is also generic and not kata specific.

@mcastelino
Copy link
Contributor Author

/cc @egernst @sameo @sboeuf @gnawux

@mcastelino
Copy link
Contributor Author

/cc @tallclair

@mcastelino
Copy link
Contributor Author

@sboeuf @egernst when we move to containerd-shim-v2 based implementations where CRIO talks to kata via gRPC we need this toml information to come through.

@mcastelino
Copy link
Contributor Author

In the case of containerd-shim-v2 today the options has a binary name but no configuration.

It may be worthwhile to explore supporting passed of additional arguments for each plugin.

https://github.com/containerd/containerd/blob/9e372ff01d7a81c325ff1c15ed84efa95e7fb5a6/runtime/v2/runc/options/oci.pb.go#L39

type Options struct {
	// disable pivot root when creating a container
	NoPivotRoot bool `protobuf:"varint,1,opt,name=no_pivot_root,json=noPivotRoot,proto3" json:"no_pivot_root,omitempty"`
	// create a new keyring for the container
	NoNewKeyring bool `protobuf:"varint,2,opt,name=no_new_keyring,json=noNewKeyring,proto3" json:"no_new_keyring,omitempty"`
	// place the shim in a cgroup
	ShimCgroup string `protobuf:"bytes,3,opt,name=shim_cgroup,json=shimCgroup,proto3" json:"shim_cgroup,omitempty"`
	// set the I/O's pipes uid
	IoUid uint32 `protobuf:"varint,4,opt,name=io_uid,json=ioUid,proto3" json:"io_uid,omitempty"`
	// set the I/O's pipes gid
	IoGid uint32 `protobuf:"varint,5,opt,name=io_gid,json=ioGid,proto3" json:"io_gid,omitempty"`
	// binary name of the runc binary
	BinaryName string `protobuf:"bytes,6,opt,name=binary_name,json=binaryName,proto3" json:"binary_name,omitempty"`
	// runc root directory
	Root string `protobuf:"bytes,7,opt,name=root,proto3" json:"root,omitempty"`
	// criu binary path
	CriuPath string `protobuf:"bytes,8,opt,name=criu_path,json=criuPath,proto3" json:"criu_path,omitempty"`
	// enable systemd cgroups
	SystemdCgroup bool `protobuf:"varint,9,opt,name=systemd_cgroup,json=systemdCgroup,proto3" json:"systemd_cgroup,omitempty"`
}

@Random-Liu
Copy link

Random-Liu commented Jan 2, 2019

@mcastelino The Options type is completely opaque to containerd and mostly opaque to the cri plugin.
It is defined by the shim implementation.

Kata runtime can define their own options, e.g. KataOptions, and add whatever useful there. And the cri plugin can configure those options based on the RuntimeClass or daemon level config. To add a new options type support, just add several lines here https://github.com/containerd/cri/blob/master/pkg/server/helpers.go#L479 As for containerd, it will just blindly pass the options from the cri plugin to the shim process.

If everyone thinks that an opaque string list works best for Kubernetes other than a well-defined struct, we can always add a general CRIRuntimeOption which just contains a string list and let everyone support it in the shim implementation.

@sboeuf
Copy link

sboeuf commented Jan 3, 2019

@mcastelino

I don't see the point in adding RuntimeProperties map[string]string to the RuntimeClass since k8s only cares about asking for a specific type of runtime. The list of parameters that needs to be passed down to the runtime is directly defined through CRI-containerd or CRI-O config.

As mentioned by @Random-Liu, there is no need for containerd to know about the options format, but instead we can simply assume the shim implementation will know what to do with it. That being said, we need to define a common behavior about what to do from CRI-containerd and CRI-O with those extra parameters, and how to handle them.

@mcastelino
Copy link
Contributor Author

I don't see the point in adding RuntimeProperties map[string]string to the RuntimeClass since k8s only cares about asking for a specific type of runtime. The list of parameters that needs to be passed down to the runtime is directly defined through CRI-containerd or CRI-O config.

@sboeuf as indicated in original issue the preferred choice is option 4. Not option 3. So I think you are agreeing?

And yes, we need to unify the behavior across CRI-containerd and CRI-O for option 4.

@mcastelino
Copy link
Contributor Author

To add a new options type support, just add several lines here https://github.com/containerd/cri/blob/master/pkg/server/helpers.go#L479 As for containerd, it will just blindly pass the options from the cri plugin to the shim process.

@Random-Liu that makes sense. The only small issue I see here is that this helper needs to be modified each time a new runtime comes along that may want its own custom options.

@raravena80
Copy link
Member

I would argue that for flexibility some k8s users will want to run pods with different isolation mechanisms (qemu, firecracker, etc). On the other hand, I think just having the config option either in containerd or CRIO is simpler.

What's the argument behind not letting this be handled in a common /etc/kata-runtime/configuration.toml file as an always default isolation mechanism? (Support a single isolation mechanism at a time)

@mcastelino
Copy link
Contributor Author

What's the argument behind not letting this be handled in a common /etc/kata-runtime/configuration.toml file as an always default isolation mechanism? (Support a single isolation mechanism at a time)

@raravena80 not sure I understand what you mean by "Support a single isolation mechanism at a time". Are you suggesting a given node should only use one type of isolation mechanism? That is possible even with this proposal. But then you need to start tagging nodes at the kubernetes layer.

All this proposal is trying to do it allow the flexibility for kata to expose multiple isolation mechanisms on the same node without duplicating binaries. This paired with an admission control policy would allow us to choose the most restrictive isolation mechanism that kata can provide while satisfying the needs of a given POD.

Addressing your other concern, the reason for multiple configuration files, one per isolation mechanism is to address the fact that each isolation mechanism has different limitations and profiles. And the toml file today contains tuneable sections for other kata components besides the hypervisor. For example you would typically run kata with macvtap networking and firecracker with tcfilter for optimal performance.

@mcastelino
Copy link
Contributor Author

And yes, we need to unify the behavior across CRI-containerd and CRI-O for option 4.

Tracking the current CRI-O proposal. We need to unify what we do across both.
cri-o/cri-o#1991

@raravena80
Copy link
Member

What's the argument behind not letting this be handled in a common /etc/kata-runtime/configuration.toml file as an always default isolation mechanism? (Support a single isolation mechanism at a time)

Chatted with @mcastelino. The answer here is that given the example of firecracker and qemu, we can allow flexibility for the user or cluster operator to run on nodes with firecracker and qemu support, and also run on nodes with only firecracker or only qemu support. Possibly, having RuntimeClass: kata-firecracker and RuntimeClass: kata-qemu.

@bergwolf
Copy link
Member

bergwolf commented Jan 4, 2019

To add a new options type support, just add several lines here https://github.com/containerd/cri/blob/master/pkg/server/helpers.go#L479 As for containerd, it will just blindly pass the options from the cri plugin to the shim process.

@Random-Liu It seems runhcs is broken when trying to decode its own options?
https://github.com/containerd/containerd/blob/master/runtime/v2/runhcs/service.go#L421

@sboeuf
Copy link

sboeuf commented Jan 4, 2019

@mcastelino

@sboeuf as indicated in original issue the preferred choice is option 4. Not option 3. So I think you are agreeing?

Yes sorry I missed the fact that 3 and 4 were two different approaches... And yes I agree 4 is the best approach here.

@lifupan
Copy link
Member

lifupan commented Jan 7, 2019

@kata-containers/runtime
I had send a PR containerd/containerd#2916 to containerd to add a kata specific options, by now I had just added "ConfFile" option
by which can pass a kata configure file path from containerd to kata shimv2.

@sboeuf
Copy link

sboeuf commented Jan 7, 2019

@lifupan maybe I'm misunderstanding here, but I thought we would introduce to CRI-containerd and CRI-O the proper implementation to know what to do with a new field runtime_params part of a runtime handler description, without having to make any special handling based on the type of runtime.
I've just taken a quick look at your patch, but it introduces things specific to Kata. I thought the way runtime_params would be handled would be runtime agnostic.

@sameo
Copy link

sameo commented Jan 7, 2019

@mcastelino Option 4 is the cleanest one imho. One thing I'd like us to improve is the ability to have per hypervisor sections in our configuration.toml, and avoid having to maintain different config files with a lot of duplication between them.
This would imply adding a new cli option e.g. --kata-hypervisor , obviously.

@jodh-intel
Copy link
Contributor

jodh-intel commented Jan 7, 2019

+1 for 4 and +1 for finding a way to reduce duplication: now might be a good time to start re-assessing our current configuration handling.

We could explore the possibility of supporting re-assembling config fragments into a complete file. This isn't fully fleshed out, but we could do something like:

  • Switch config to YAML (it's superior to TOML and fragments of YAML would make more sense than fragments of TOML imho).

  • Provide config files for each hypervisor:

    /usr/share/defaults/kata-containers/common.yaml
    /usr/share/defaults/kata-containers/hypervisor/firecracker.yaml
    /usr/share/defaults/kata-containers/hypervisor/nemu.yaml
    /usr/share/defaults/kata-containers/hypervisor/qemu.yaml
    /usr/share/defaults/kata-containers/hypervisor/qemu-lite.yaml
    

The runtime would read common.yaml first, followed by hypervisor/${hypervisor}.yaml. That could be overriden with @sameo's --kata-hypervisor= but could default to a value specified in common.yaml. Crucially, the hypervisor config files would just specify the minimum set of options for that hypervisor (maybe just runtime path, kernel params and network options?) Everything else would be defined in common.yaml.

  • Running kata-runtime kata-env --format yaml would display the fully-assembled config as a single YAML document. It would of course also continue to validate that assembled configuration so could be used as a simple "config validator".

  • If users/admins want to modify settings, they could do one of two things:

    • Create /etc/kata-containers/merge/ to override one or more config options (but keep all other config options as their defaults).
    • Create /etc/kata-containers/override/... to completely replace the defaults with the user-specific options. Any missing required options here would be considered an error.

Alternatively, we could look at creating a more declarative configuration language where users would not explicitly specify the hypervisor by name, they'd somehow specify the behaviours / constraints they want and Kata would DTRT (tm) and determine an appropriate configuration. We'd clearly need to handle the scenario where >1 possible legal config could be generated though (which option wins? Or do we just error?) So rather than specifying options in 4. as:

runtime_params = "--kata-config /etc/kata-runtime/firecracker.toml"

... users could say something like:

runtime_features = [ "use_hypervisor_with_coolest_logo", "debug=full" ]

The runtime would then resolve these values (via a new --kata-config-features ...) into an actual configuration. Note that the simple debug=full would enable debug for all components and modify the kernel params to enable agent debug for example (but we could of course also support things like debug=agent,runtime). If we did this, we would of course log the resolved config options and kata-env would as usual display the config the runtime would use.

@lifupan
Copy link
Member

lifupan commented Jan 7, 2019

@lifupan maybe I'm misunderstanding here, but I thought we would introduce to CRI-containerd and CRI-O the proper implementation to know what to do with a new field runtime_params part of a runtime handler description, without having to make any special handling based on the type of runtime.
I've just taken a quick look at your patch, but it introduces things specific to Kata. I thought the way runtime_params would be handled would be runtime agnostic.

@sboeuf Yes, the Options is specific to kata, but it's opaque to containerd/cri, containerd will pass it to kata shimv2 blindly and we can parse it in kata shimv2 side. By this, we can not only pass the confFile, we can even pass other options if needed such as the "hypervisor type" if we want to support much more hypervisors in a a configure file just as @sameo said above.

The containerd's config file will be configured as below:

[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins.cri.containerd.runtimes.kata.options]
ConfFile = "/usr/share/defaults/kata-containers/configuration.toml"
Hypervisor = "qemu"

@mcastelino
Copy link
Contributor Author

This would imply adding a new cli option e.g. --kata-hypervisor , obviously.

@sameo @jodh-intel the issue with this approach would be that all hypervisors would be forced to use the same setup for all other components. This could end up being a limiting factor as there may be hypervisors which would need multually incompatible options. Say tcmirror vs macvtap or something else.

Hence having a fully defined environment per hypervisor would be better to make this future proof.

@jodh-intel
Copy link
Contributor

@mcastelino - I'm suggesting that if the runtime is invoked with, say, --kata-hypervisor firecracker, the following would be read:

/usr/share/defaults/kata-containers/common.yaml
/usr/share/defaults/kata-containers/hypervisor/firecracker.yaml

Where,

  • common.yaml can set the common defaults (if any).
  • firecracker.yaml can override any values set by common.yaml
    (such a runtime path, network setup, ...)

The common file could set internetworking_model="macvtap" but that could be overriden by firecracker.yaml to internetworking_model="foo" or whatever.

Isn't that what you want?

lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 25, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
lifupan added a commit to lifupan/kata-runtime that referenced this issue Jan 28, 2019
containerd/cri's different runtime handlers can pass different
config files to shimv2 by a generic runtime options, by this kata
can launch the pods using different VMM for different runtime handlers.

Fixes:kata-containers#1082

Signed-off-by: Fupan Li <lifupan@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants