Support alternate runtime for host privileged operations #367

sameo · 2017-02-21T18:59:54Z

Some CRI-O runtimes (like cc-oci-runtime) do not support many host privileged operations like e.g. giving access to the host namespaces or running fully privileged containers (with access to all host devices). Those runtimes usually provide a higher level of container security by e.g. running container workloads within virtual machines and therefore running host privileged containers with them makes little sense.

This pull request tries to overcome that problem by allowing ocid users to define a host privileged capable runtime path in addition to the default runtime path. When ocid gets a request for creating a container or a sandbox with either the privileged flag set or access to at least one of the host namespaces (PID, IPC or networking), it will check if a host privileged runtime is defined and use it if it is. It will obviously use the default runtime if the host privileged one is not defined.

This PR only checks for the pod security context and each container within a given pod will inherit its pod privileged flags. In other words, that means we will run the privileged runtime in either one of those 2 cases:

The sandbox supports running at least one privileged container.
The sandbox requires access to either one of the host PID, IPC or networking namespace.

Note

I realize this might be fixed with CRI multiple runtimes support. If/when this happens, we should be able to remove part of this PR code.

cc @feiskyer @mcastelino

rh-atomic-bot · 2017-02-21T19:45:17Z

0/0 passed on RHEL - Passed.
0/0 passed on CentOS - Passed.
0/0 passed on Fedora - Passed.
Log - https://aos-ci.s3.amazonaws.com/kubernetes-incubator/cri-o/crio-integration-tests-prs/168/fullresults.txt

cyphar · 2017-02-21T21:15:14Z

server/container_create.go

+		nsPath := fmt.Sprintf("/proc/%d/ns/%s", podInfraState.Pid, nsFile)
+		if err := specgen.AddOrReplaceLinuxNamespace((string)(nsType), nsPath); err != nil {
+			return nil, err
+		}


I don't think this is correct. Different containers in a Pod do not share mount, pid or cgroup namespaces. And I'm fairly sure they don't share uts or user namespaces either.

What is the reason for this change?

That nsPath looks like it should use filepath.Join instead? Also string(nsType) instead of (string)(nsType)? (I didn't even realize the latter is even valid Go code?)

(string) -> string is why it works.

But that's not the main point I'm making. I'm saying that the logic behind the change doesn't make sense to me (different containers in a sandbox do not share all of their namespaces).

Looks like the direction is to by default have containers in a pod share network, PID and IPC namespaces.

kubernetes/community#207

Caught me by surprise the other day as well. But it's switchable. Gives pods a higher level purpose I suppose.

cyphar · 2017-02-21T21:21:06Z

I'm a bit confused why this logic has to live in cri-o. Is there a reason why alternative runtimes can't do one of the following:

Look at the config.json that they've been given, and then figure out whether they have to run in "privileged" or "unprivileged" mode (effectively a wrapper).
Implement this "is this operation privileged" logic in cri-o, but expose it as an annotation rather than including two different runtime paths inside cri-o.

sameo · 2017-02-22T00:02:07Z

@cyphar

But that's not the main point I'm making. I'm saying that the logic behind the change doesn't make sense to me (different containers in a sandbox do not share all of their namespaces).

You're correct, the default k8s behavior is to share the PID, IPC and networking namespace, I'll fix that patch.

sameo · 2017-02-22T00:58:36Z

@cyphar

I'm a bit confused why this logic has to live in cri-o. Is there a reason why alternative runtimes can't do one of the following:

Look at the config.json that they've been given, and then figure out whether they have to run in "privileged" or "unprivileged" mode (effectively a wrapper).

Implement this "is this operation privileged" logic in cri-o, but expose it as an annotation rather than including two different runtime paths inside cri-o.

Good points. Solution 1 is something I thought about. However, running privileged containers with any hypervisor based runtime is not possible (You can't see or change the host network namespace for example) so this would basically mean calling another runtime (e.g. runc) from those runtimes.
This seems a bit wrong to me and to some extend the same logic could be applied to runc for supporting multiple runtimes: Depending on the potential annotations it'd get from kubelet/CRI it would fork different runtimes when it does not support the incoming annotation (In case we get a declarative, intent annotation like e.g. kubernetes/untrusted_workload=true).

feiskyer · 2017-02-22T03:34:54Z

+1 for the direction.

Due to CRI not clearly defining the pod and container security context ordering (Right now, the container security context can be more permissive than the pod one) this PR checks for both the pod and container security context although ideally it should only check for the pod one. I'm filling a k8s issue for this.

I think we should only switch to privileged runtime in these two conditions:

privileged sandbox and its containers (LinuxSandboxSecurityContext.Priviledged = true)
HostNetwork, HostIPC and HostPid sandbox and its containers (LinuxSandboxSecurityContext. NamespaceOption is set)

sameo · 2017-02-22T10:12:13Z

@feiskyer

I think we should only switch to privileged runtime in these two conditions:

privileged sandbox and its containers (LinuxSandboxSecurityContext.Priviledged = true)

HostNetwork, HostIPC and HostPid sandbox and its containers (LinuxSandboxSecurityContext. NamespaceOption is set)

Ok, then we assume that the container security context will always be the same as the pod one. As you commented on k8s issue 41848, kuberuntime currently sets all containers security contexts to be the same as their pod.

Right now this PR checks for both the containers and the pod security contexts, I'll fix the code to follow the assumption that they will always be identical.

sameo · 2017-02-22T10:29:44Z

@feiskyer

Ok, then we assume that the container security context will always be the same as the pod one. As you commented on k8s issue 41848, kuberuntime currently sets all containers security contexts to be the same as their pod.

After further private discussions, this is incorrect. The security context will not always be the same, only the HostNetwork field will.

And as you described, we should run a privileged runtime for all containers if sandbox.privileged is set to true. In that case, the pod could contain a mix of privileged and unprivileged containers but we can't mix them together.

With that understanding, I'll fix my PR.

sameo · 2017-02-22T18:54:42Z

@feiskyer PR updated according to the latest comments.

feiskyer · 2017-02-27T11:59:48Z

@sameo LGTM.

cc/ @cyphar @mrunalp PTAL

mrunalp · 2017-02-28T18:09:10Z

server/server.go

@@ -50,6 +52,64 @@ type Server struct {
 	appArmorProfile string
 }

+func ociPrivileged(spec *rspec.Spec) bool {


Can we add a comment here? Also, shouldn't we also check for uid = 0?

Isn't --privileged different from just running as uid 0?

@mrunalp I added some comments to the routine, hopefully they make sense.

mrunalp · 2017-02-28T18:10:47Z

server/container_create.go

-	if err := specgen.AddOrReplaceLinuxNamespace("ipc", ipcNsPath); err != nil {
-		return nil, err
+	for nsType, nsFile := range map[rspec.NamespaceType]string{
+		rspec.PIDNamespace: "pid",


We can take this out for now as k8s is still discussing viability and it may be pushed to next release.

mrunalp · 2017-02-28T18:12:12Z

How do we handle the case where sandbox privileged is not triggered but a container is privileged? Isn't it too late?

mcastelino · 2017-02-28T18:14:54Z

@mrunalp @sameo it would be good if we could specify the high water mark for privileges which can be inferred from the POD definition. Having such a high water mark will prevent the POD from being in a partially created state.

feiskyer · 2017-03-01T01:52:53Z

How do we handle the case where sandbox privileged is not triggered but a container is privileged? Isn't it too late?

@mrunalp Kubelet ensures sandbox privileged is set when at least one of containers belonging to it is privileged.

jawnsy · 2017-03-01T16:53:11Z

oci/oci.go

@@ -34,24 +34,26 @@ const (
 )

 // New creates a new Runtime with options provided
-func New(runtimePath string, conmonPath string, conmonEnv []string, cgroupManager string) (*Runtime, error) {
+func New(runtimePath string, runtimeHostPrivilegedPath string, conmonPath string, conmonEnv []string, cgroupManager string) (*Runtime, error) {


Design question: would it make sense to pass in a struct with configuration instead? Every time a new parameter is added (as today), the method and all its callers have to be adjusted, and it's annoying for optional parameters (because all the call sites will need to be changed to pass in zero values like "")

Have we considered passing in configuration as a special struct type, like:

type RuntimeConfig struct { Name string Path string ... } func NewRuntime(config *RuntimeConfig) (*Runtime, error) { r := &Runtime{ config: config, } return r, nil }

(sidenote, what's the point in error as a second return type, if errors can never occur?)

mrunalp · 2017-03-02T02:57:28Z

@mrunalp Kubelet ensures sandbox privileged is set when at least one of containers belonging to it is privileged.

@feiskyer @sameo Does this mean that we just need to check for sandbox privileged flag and there is no need to look at the container config?

feiskyer · 2017-03-02T03:12:22Z

Does this mean that we just need to check for sandbox privileged flag and there is no need to look at the container config?

I think so. We should only check sandbox config to decide the privileged runtime.

Not all runtimes are able to handle some of the kubelet security context options, in particular the ones granting host privileges to containers. By adding a host privileged runtime path configuration, we allow ocid to use a different runtime for host privileged operations like e.g. host namespaces access. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

We add a privileged flag to the container and sandbox structures and can now select the appropriate runtime path for any container operations depending on that flag. Here again, the default runtime will be used for non privileged containers and for privileged ones in case there are no privileged runtime defined. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

The sandbox privileged flag is set to true only if either the pod configuration privileged flag is set to true or when any of the pod namespaces are the host ones. A container inherit its privileged flag from its sandbox, and will be run by the privileged runtime only if it's set to true. In other words, the privileged runtime (when defined) will be when one of the below conditions is true: - The sandbox will be asked to run at least one privileged container. - The sandbox requires access to either the host IPC or networking namespaces. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

By factorizing the bind mounts generation code. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

sameo · 2017-03-03T18:10:13Z

@mrunalp Simplified the code through a pod annotation, as discussed during the ocid meeting.

mrunalp · 2017-03-04T17:30:24Z

LGTM

feiskyer · 2017-03-04T23:53:17Z

LGTM

k8s-ci-robot added the cncf-cla: yes label Feb 21, 2017

cyphar reviewed Feb 21, 2017

View reviewed changes

sameo force-pushed the topic/host-privileged-runtime branch from 94408d0 to 09415b9 Compare February 22, 2017 00:19

feiskyer assigned mrunalp, feiskyer and cyphar Feb 22, 2017

sameo force-pushed the topic/host-privileged-runtime branch 3 times, most recently from f1583a9 to 404de17 Compare February 22, 2017 18:54

sameo force-pushed the topic/host-privileged-runtime branch from 404de17 to 6143f04 Compare February 23, 2017 08:25

mrunalp reviewed Feb 28, 2017

View reviewed changes

sameo force-pushed the topic/host-privileged-runtime branch 2 times, most recently from 42bc143 to 5dce9d6 Compare March 1, 2017 14:33

jawnsy reviewed Mar 1, 2017

View reviewed changes

Samuel Ortiz added 3 commits March 3, 2017 17:22

server: Reduce createSandboxContainer complexity

f7eee71

By factorizing the bind mounts generation code. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

sameo force-pushed the topic/host-privileged-runtime branch from 5dce9d6 to f7eee71 Compare March 3, 2017 18:09

sameo changed the title ~~[RFC] Support alternate runtime for host privileged operations~~ Support alternate runtime for host privileged operations Mar 3, 2017

feiskyer merged commit 3195f45 into cri-o:master Mar 4, 2017

spiffxp unassigned cyphar Sep 6, 2018

cblecker unassigned feiskyer Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support alternate runtime for host privileged operations #367

Support alternate runtime for host privileged operations #367

sameo commented Feb 21, 2017 •

edited

Loading

rh-atomic-bot commented Feb 21, 2017

cyphar Feb 21, 2017

jawnsy Feb 21, 2017

cyphar Feb 21, 2017

mikebrow Feb 21, 2017 •

edited

Loading

cyphar commented Feb 21, 2017

sameo commented Feb 22, 2017

sameo commented Feb 22, 2017

feiskyer commented Feb 22, 2017

sameo commented Feb 22, 2017

sameo commented Feb 22, 2017 •

edited

Loading

sameo commented Feb 22, 2017

feiskyer commented Feb 27, 2017

mrunalp Feb 28, 2017

jawnsy Feb 28, 2017

sameo Mar 1, 2017

mrunalp Feb 28, 2017

mrunalp commented Feb 28, 2017

mcastelino commented Feb 28, 2017

feiskyer commented Mar 1, 2017

jawnsy Mar 1, 2017 •

edited

Loading

mrunalp commented Mar 2, 2017 •

edited

Loading

feiskyer commented Mar 2, 2017

sameo commented Mar 3, 2017

mrunalp commented Mar 4, 2017

feiskyer commented Mar 4, 2017

Support alternate runtime for host privileged operations #367

Support alternate runtime for host privileged operations #367

Conversation

sameo commented Feb 21, 2017 • edited Loading

Note

rh-atomic-bot commented Feb 21, 2017

cyphar Feb 21, 2017

Choose a reason for hiding this comment

jawnsy Feb 21, 2017

Choose a reason for hiding this comment

cyphar Feb 21, 2017

Choose a reason for hiding this comment

mikebrow Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

cyphar commented Feb 21, 2017

sameo commented Feb 22, 2017

sameo commented Feb 22, 2017

feiskyer commented Feb 22, 2017

sameo commented Feb 22, 2017

sameo commented Feb 22, 2017 • edited Loading

sameo commented Feb 22, 2017

feiskyer commented Feb 27, 2017

mrunalp Feb 28, 2017

Choose a reason for hiding this comment

jawnsy Feb 28, 2017

Choose a reason for hiding this comment

sameo Mar 1, 2017

Choose a reason for hiding this comment

mrunalp Feb 28, 2017

Choose a reason for hiding this comment

mrunalp commented Feb 28, 2017

mcastelino commented Feb 28, 2017

feiskyer commented Mar 1, 2017

jawnsy Mar 1, 2017 • edited Loading

Choose a reason for hiding this comment

mrunalp commented Mar 2, 2017 • edited Loading

feiskyer commented Mar 2, 2017

sameo commented Mar 3, 2017

mrunalp commented Mar 4, 2017

feiskyer commented Mar 4, 2017

sameo commented Feb 21, 2017 •

edited

Loading

mikebrow Feb 21, 2017 •

edited

Loading

sameo commented Feb 22, 2017 •

edited

Loading

jawnsy Mar 1, 2017 •

edited

Loading

mrunalp commented Mar 2, 2017 •

edited

Loading