Utilize pod process namespace sharing instead of docker executor #970

jessesuen · 2018-08-25T07:37:02Z

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened:

Currently, artifact saving is performed through the docker executor, which implements artifact copying via a docker cp command to copy out artifacts from the main container. The problem with this approach is it requires mounting of docker.sock of the host, which is insecure, and unacceptable in some secure environments.

In K8s, there is an (alpha) feature to share the process namespace and the filesystem between containers in a pod:
https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/

Instead of utilizing the docker executor, we could simply create the pod spec with shareProcessNamespace: true and access the filesystem of the main container to copy files directly. Similarly, the actual waiting and killing of the process, would only need to be done via normal kill command from the argoexec sidecar, as opposed to a docker kill.

Using process namespace sharing provides us an ideal solution which addresses a lot of the security, and scalability concerns with our current docker, kubelet, or k8s API server approaches.

NOTE that this is an alpha feature and needs an feature gate (--feature-gates=PodShareProcessNamespace=true) configured for it to be enabled.

The text was updated successfully, but these errors were encountered:

jessesuen · 2018-08-25T07:42:44Z

/cc @edlee2121 @JulienBalestra @gaganapplatix

I don't think we need to spend any effort around a pure K8s API server executor implementation, since I think process namespace sharing will be the future of how we do artifact management, process waiting, process killing. Longer term, I think the docker/kubelet executors will eventually go away if the argoexec sidecar can already access the main container's filesystem easily through process namespace sharing.

edlee2121 · 2018-08-25T16:26:01Z

For copying artifacts, we would share process namespaces and access files via /proc/pid/root?

jessesuen · 2018-08-25T16:45:55Z

That’s right

srikumar-b · 2018-10-02T17:57:25Z

I have quickly tested process namespace sharing (now beta feature enabled by default) on k8s v1.12 and is working as expected to access other container's filesystem.

jessesuen · 2018-10-30T23:22:51Z

Update on this issue.

After some deeper investigation, process namespace sharing does not solve the problem by itself. The crux of the issue is that, when the main process (container) exits, the processes filesystem of the main container goes away. In other words, the filesystem of the main container is only accessible at /proc/<mainpid>/root for the life of the process. Because the wait container needs to wait until after the main container completes before it starts copying artifacts, it is already too late to perform the copy. There are several techniques we could do to allow the wait sidecar continued access to the main container filesystem after the process exits, with various security and usage implications. The following are the approaches that have been investigated, or are currently under consideration along with their tradeoffs:

1. setns with CAP_SYS_ADMIN capability

This approach uses shareProcessNamespace: true along with CAP_SYS_ADMIN privileges for the sidecar and performs a setns call to have a handle on the main containers filesystem and mounts. The pod spec would be constructed as follows:

apiVersion: v1
kind: Pod
metadata:
  name: some-workflow-1231231231
spec:
  shareProcessNamespace: true
  containers:
  - name: main
    image: docker/whalesay
  - name: wait
    image: argoproj/argoexec:latest
    securityContext:
      capabilities:
        add:
        - SYS_ADMIN

With this technique, as soon as the wait sidecar starts, it immediately infers the main container's pid and performs a setns syscall (http://man7.org/linux/man-pages/man2/setns.2.html) to that pid along with recursive mount of the main container's mounts. For those familiar with nsenter, it would be similar to the command:

nsenter --target <mainpid> --mount

The SYS_ADMIN capability is required since it is only possible to perform the setns and bind mounts
with that level of privilege.

After the main process exits, the wait process still has a file handle on the main containers filesystem, and can proceed to upload files. The obvious disadvantage of this approach is that SYS_ADMIN essentially makes the sidecar a privileged pod. For this reason, this approach would not be acceptable in secure environments. A second disadvantage is timing related. It is possible for the main container to run to completion before the pid was even available to the wait sidecar. In this situation, the wait sidecar would not be able to copy out the artifacts.

2. Overriding entrypoint of the main container with argoexec

The following technique would inject an argoexec binary into the main container and would
additionally modify the entrypoint of the main container to be argoexec. The pod spec would be constructed similar to the following:

apiVersion: v1
kind: Pod
metadata:
  name: some-workflow-1231231231
spec:
  shareProcessNamespace: true
  initContainers:
  - name: argoproj/argoexec:latest
    image: argoproj/argoexec:latest
    command: [cp, /bin/argoexec, /argomnt/bin/argoexec]
    volumeMounts:
    - mountPath: /argomnt
      name: argomnt
  containers:
  - name: main
    image: docker/whalesay
    command: [/argomnt/argoexec]
    volumeMounts:
    - mountPath: /argomnt
      name: argomnt
  - name: wait
    image: argoproj/argoexec:latest
  volumes:
  - name: argomnt
    emptyDir: {}

With this approach, the entrypoint of the main container is modified to be that of the argexec
static binary. The purpose of modifying the entrypoint is so that the executor could fork the
user's desired command/args, and when the command completes, artificially extend the life of the
main container while the sidecar performs the copy out.

The disadvantages of this approach are that (1) it is most intrusive to the main container. (2) it
would only work in the case where the container spec was explicit about it's command: field.
In other words, if command/args are omitted from the container spec, it would not be easy to infer
what the default entrypoint is of the docker image. (3) the logic of how argoexec binary communicates from the main to the wait would be complicated (e.g. argoexec from main container would need to indicate to the argoexec sidecar that the process completed and that artifacts are ready to be copied).

3. chroot `/proc/<mainpid>/root`

This is similar to suggestion 1, in that the wait sidecar immediately obtains handle on the main container filesystem for the purposes of artifact collection when the main container completes. The benefit this has over approach 1, is that chroot can be performed without any extra privileges aside from what was granted with shareProcessNamespace: true. However, the reason this approach does not work, is that chroot does not propogate the volume mounts of the main container. Meaning, while the base layer of the container image is accessible using chroot, none of the layered volume mounts performed ontop of it (e.g. emptyDir, PVCs) would be accessible in the chroot jail.

4. Combination of chroot with volume mounts mirrored to wait sidecar

Finally we come to the most promising approach. This technique use two mechanisms for copying out artifacts, depending on the path. First the workflow pod spec would be constructed such that all volume mounts of the main container would be also mounted in the wait sidecar. When the wait sidecar initially starts, it performs a chroot /proc/<mainpid>/root to secure the file handle on the main's root filesystem. When the main container completes, the wait process still has access to the filesystem and can copy out artifacts from the base image layer. If the path to the artifact is from a volumeMount, the wait process will perform the copy from the mounted volume that was mirrored to the wait sidecar.

The only disadvantage of this technique is that it is still subject to the same timing issue described in approach 1, where it is possible for the main container to run to completion before the wait sidecar even starts and has the chance to secure the file handle. That said, this only affects a corner case where the artifact being collected is located on the base image layer, and not a volumeMount on the main container. If the artifact resides on a volume mount (including an emptyDir), then the sidecar still has the opportunity to copy the file.

srikumar-b · 2018-12-05T18:48:52Z

Could you let us know, any further development on finalizing the approach from the above list?

jessesuen · 2018-12-05T19:26:30Z

@srikumar-b yes, I'm close to finishing approach #4.

xianlubird · 2019-01-08T12:41:25Z

@jessesuen Any update on this ? Is there a PR for thie issue?

animeshsingh · 2019-02-26T21:32:05Z

Coming to this from kubeflow pipeline community.
kubeflow/pipelines#678

We have phased out Docker support in IBM Kubernetes Service, and are now relying on containerd, This is one blocking issue currently impacting the adotption for us. Any further updates on thsi?

zak-hassan · 2019-03-02T22:24:31Z

Seems like this might be the problem:

https://github.com/zmhassan/argo/blob/master/workflow/controller/workflowpod.go#L52-L59

Users shouldn't be required to mount hostpath.

zak-hassan · 2019-03-02T22:25:45Z

Perhaps you can pass in an option into your Custom Resource to disable this as some users might just be using argo just for running workflows and might not require copying files. One option could be to have the code within the container running the job connect to s3 or some cloud storage to store these files. or even minio could be a drop in option.

llegolas · 2019-03-07T12:07:28Z

Perhaps you can pass in an option into your Custom Resource to disable this as some users might just be using argo just for running workflows and might not require copying files. One option could be to have the code within the container running the job connect to s3 or some cloud storage to store these files. or even minio could be a drop in option.

Do you know how this option can be passed or is it existing at all ?

jessesuen · 2019-04-19T22:38:50Z

Fixed.

K8s, Kubelete, and PNS executors will no longer mount docker.sock. The docker executor still needs docker.sock access even if not copying out artifacts, in order to perform a docker wait for waiting for the container to complete.

animeshsingh · 2019-04-19T22:48:23Z

Thanks @jessesuen - can we now use either K8s and PNS executors, and get same functional behaviour from Argo, if not scalability?

jessesuen · 2019-04-19T23:22:52Z

can we now use either K8s and PNS executors, and get same functional behaviour from Argo, if not scalability?

The main difference is the K8s executor puts some additional polling load on k8s API server for waiting for container completion, whereas PNS is polling the local linux kernel (procfs) for container completion. Also PNS requires K8s 1.12 for PNS support (without the feature gate).

I'm copying my pro/con list from the other issue.

Docker:

+ supports all workflow examples
+ most reliable and well tested
+ very scalable. communicates to docker daemon for heavy lifting
- least secure. requires docker.sock of host to be mounted (often rejected by OPA)

kubelet

+ secure. cannot escape privileges of pod's service account
+ medium scalability - log retrieval and container polling is done against kubelet
- additional kubelet configuration may be required
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)

K8s API

+ secure. cannot escape privileges of pod's service account
+ no extra configuration
- least scalable - log retrieval and container polling is done against k8s API server
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)

PNS

+ secure. cannot escape privileges of service account
+ artifact collection can be collected from base image layer
+ scalable - process polling is done over procfs and not kubelet/k8s API
- processes will no longer run with pid 1
- artifact collection from base image may fail for containers which complete too fast
- cannot capture artifact directories from base image layer which has a volume mounted under it
- immature

jessesuen self-assigned this Aug 29, 2018

jessesuen added this to the V2.3 milestone Aug 29, 2018

jessesuen mentioned this issue Sep 28, 2018

Add exports parameter for artifact sharing #1022

Closed

hongye-sun mentioned this issue Dec 21, 2018

Example is trying to mount hostPath for docker in docker kubeflow/pipelines#561

Closed

hongye-sun mentioned this issue Jan 18, 2019

KFP should support containerd runtime on Kube kubeflow/pipelines#678

Closed

alexmt mentioned this issue Jan 25, 2019

Issue #1123 - Fix 'kubectl get' failure if resource namespace is different from workflow namespace #1171

Merged

jessesuen mentioned this issue Feb 12, 2019

Implement support for PNS (Process Namespace Sharing) executor #1214

Merged

Tomcli mentioned this issue Mar 12, 2019

Experiencing issues with the PNS executor #1256

Closed

jessesuen closed this as completed Apr 19, 2019

jessesuen mentioned this issue Apr 19, 2019

Cannot connect to the Docker daemon at unix:///var/run/docker.sock #1216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilize pod process namespace sharing instead of docker executor #970

Utilize pod process namespace sharing instead of docker executor #970

jessesuen commented Aug 25, 2018 •

edited

Loading

jessesuen commented Aug 25, 2018 •

edited

Loading

edlee2121 commented Aug 25, 2018

jessesuen commented Aug 25, 2018

srikumar-b commented Oct 2, 2018

jessesuen commented Oct 30, 2018 •

edited

Loading

srikumar-b commented Dec 5, 2018

jessesuen commented Dec 5, 2018

xianlubird commented Jan 8, 2019

animeshsingh commented Feb 26, 2019

zak-hassan commented Mar 2, 2019 •

edited

Loading

zak-hassan commented Mar 2, 2019 •

edited

Loading

llegolas commented Mar 7, 2019

jessesuen commented Apr 19, 2019

animeshsingh commented Apr 19, 2019

jessesuen commented Apr 19, 2019

Utilize pod process namespace sharing instead of docker executor #970

Utilize pod process namespace sharing instead of docker executor #970

Comments

jessesuen commented Aug 25, 2018 • edited Loading

jessesuen commented Aug 25, 2018 • edited Loading

edlee2121 commented Aug 25, 2018

jessesuen commented Aug 25, 2018

srikumar-b commented Oct 2, 2018

jessesuen commented Oct 30, 2018 • edited Loading

1. setns with CAP_SYS_ADMIN capability

2. Overriding entrypoint of the main container with argoexec

3. chroot /proc/<mainpid>/root

4. Combination of chroot with volume mounts mirrored to wait sidecar

srikumar-b commented Dec 5, 2018

jessesuen commented Dec 5, 2018

xianlubird commented Jan 8, 2019

animeshsingh commented Feb 26, 2019

zak-hassan commented Mar 2, 2019 • edited Loading

zak-hassan commented Mar 2, 2019 • edited Loading

llegolas commented Mar 7, 2019

jessesuen commented Apr 19, 2019

animeshsingh commented Apr 19, 2019

jessesuen commented Apr 19, 2019

jessesuen commented Aug 25, 2018 •

edited

Loading

jessesuen commented Aug 25, 2018 •

edited

Loading

jessesuen commented Oct 30, 2018 •

edited

Loading

3. chroot `/proc/<mainpid>/root`

zak-hassan commented Mar 2, 2019 •

edited

Loading

zak-hassan commented Mar 2, 2019 •

edited

Loading