Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize pod process namespace sharing instead of docker executor #970

Closed
jessesuen opened this issue Aug 25, 2018 · 15 comments
Closed

Utilize pod process namespace sharing instead of docker executor #970

jessesuen opened this issue Aug 25, 2018 · 15 comments
Assignees
Milestone

Comments

@jessesuen
Copy link
Member

jessesuen commented Aug 25, 2018

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened:

Currently, artifact saving is performed through the docker executor, which implements artifact copying via a docker cp command to copy out artifacts from the main container. The problem with this approach is it requires mounting of docker.sock of the host, which is insecure, and unacceptable in some secure environments.

In K8s, there is an (alpha) feature to share the process namespace and the filesystem between containers in a pod:
https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/

Instead of utilizing the docker executor, we could simply create the pod spec with shareProcessNamespace: true and access the filesystem of the main container to copy files directly. Similarly, the actual waiting and killing of the process, would only need to be done via normal kill command from the argoexec sidecar, as opposed to a docker kill.

Using process namespace sharing provides us an ideal solution which addresses a lot of the security, and scalability concerns with our current docker, kubelet, or k8s API server approaches.

NOTE that this is an alpha feature and needs an feature gate (--feature-gates=PodShareProcessNamespace=true) configured for it to be enabled.

@jessesuen
Copy link
Member Author

jessesuen commented Aug 25, 2018

/cc @edlee2121 @JulienBalestra @gaganapplatix

I don't think we need to spend any effort around a pure K8s API server executor implementation, since I think process namespace sharing will be the future of how we do artifact management, process waiting, process killing. Longer term, I think the docker/kubelet executors will eventually go away if the argoexec sidecar can already access the main container's filesystem easily through process namespace sharing.

@edlee2121
Copy link
Contributor

For copying artifacts, we would share process namespaces and access files via /proc/pid/root?

@jessesuen
Copy link
Member Author

That’s right

@jessesuen jessesuen self-assigned this Aug 29, 2018
@jessesuen jessesuen added this to the V2.3 milestone Aug 29, 2018
@srikumar-b
Copy link

I have quickly tested process namespace sharing (now beta feature enabled by default) on k8s v1.12 and is working as expected to access other container's filesystem.

@jessesuen
Copy link
Member Author

jessesuen commented Oct 30, 2018

Update on this issue.

After some deeper investigation, process namespace sharing does not solve the problem by itself. The crux of the issue is that, when the main process (container) exits, the processes filesystem of the main container goes away. In other words, the filesystem of the main container is only accessible at /proc/<mainpid>/root for the life of the process. Because the wait container needs to wait until after the main container completes before it starts copying artifacts, it is already too late to perform the copy. There are several techniques we could do to allow the wait sidecar continued access to the main container filesystem after the process exits, with various security and usage implications. The following are the approaches that have been investigated, or are currently under consideration along with their tradeoffs:

1. setns with CAP_SYS_ADMIN capability

This approach uses shareProcessNamespace: true along with CAP_SYS_ADMIN privileges for the sidecar and performs a setns call to have a handle on the main containers filesystem and mounts. The pod spec would be constructed as follows:

apiVersion: v1
kind: Pod
metadata:
  name: some-workflow-1231231231
spec:
  shareProcessNamespace: true
  containers:
  - name: main
    image: docker/whalesay
  - name: wait
    image: argoproj/argoexec:latest
    securityContext:
      capabilities:
        add:
        - SYS_ADMIN

With this technique, as soon as the wait sidecar starts, it immediately infers the main container's pid and performs a setns syscall (http://man7.org/linux/man-pages/man2/setns.2.html) to that pid along with recursive mount of the main container's mounts. For those familiar with nsenter, it would be similar to the command:

nsenter --target <mainpid> --mount

The SYS_ADMIN capability is required since it is only possible to perform the setns and bind mounts
with that level of privilege.

After the main process exits, the wait process still has a file handle on the main containers filesystem, and can proceed to upload files. The obvious disadvantage of this approach is that SYS_ADMIN essentially makes the sidecar a privileged pod. For this reason, this approach would not be acceptable in secure environments. A second disadvantage is timing related. It is possible for the main container to run to completion before the pid was even available to the wait sidecar. In this situation, the wait sidecar would not be able to copy out the artifacts.

2. Overriding entrypoint of the main container with argoexec

The following technique would inject an argoexec binary into the main container and would
additionally modify the entrypoint of the main container to be argoexec. The pod spec would be constructed similar to the following:

apiVersion: v1
kind: Pod
metadata:
  name: some-workflow-1231231231
spec:
  shareProcessNamespace: true
  initContainers:
  - name: argoproj/argoexec:latest
    image: argoproj/argoexec:latest
    command: [cp, /bin/argoexec, /argomnt/bin/argoexec]
    volumeMounts:
    - mountPath: /argomnt
      name: argomnt
  containers:
  - name: main
    image: docker/whalesay
    command: [/argomnt/argoexec]
    volumeMounts:
    - mountPath: /argomnt
      name: argomnt
  - name: wait
    image: argoproj/argoexec:latest
  volumes:
  - name: argomnt
    emptyDir: {}

With this approach, the entrypoint of the main container is modified to be that of the argexec
static binary. The purpose of modifying the entrypoint is so that the executor could fork the
user's desired command/args, and when the command completes, artificially extend the life of the
main container while the sidecar performs the copy out.

The disadvantages of this approach are that (1) it is most intrusive to the main container. (2) it
would only work in the case where the container spec was explicit about it's command: field.
In other words, if command/args are omitted from the container spec, it would not be easy to infer
what the default entrypoint is of the docker image. (3) the logic of how argoexec binary communicates from the main to the wait would be complicated (e.g. argoexec from main container would need to indicate to the argoexec sidecar that the process completed and that artifacts are ready to be copied).

3. chroot /proc/<mainpid>/root

This is similar to suggestion 1, in that the wait sidecar immediately obtains handle on the main container filesystem for the purposes of artifact collection when the main container completes. The benefit this has over approach 1, is that chroot can be performed without any extra privileges aside from what was granted with shareProcessNamespace: true. However, the reason this approach does not work, is that chroot does not propogate the volume mounts of the main container. Meaning, while the base layer of the container image is accessible using chroot, none of the layered volume mounts performed ontop of it (e.g. emptyDir, PVCs) would be accessible in the chroot jail.

4. Combination of chroot with volume mounts mirrored to wait sidecar

Finally we come to the most promising approach. This technique use two mechanisms for copying out artifacts, depending on the path. First the workflow pod spec would be constructed such that all volume mounts of the main container would be also mounted in the wait sidecar. When the wait sidecar initially starts, it performs a chroot /proc/<mainpid>/root to secure the file handle on the main's root filesystem. When the main container completes, the wait process still has access to the filesystem and can copy out artifacts from the base image layer. If the path to the artifact is from a volumeMount, the wait process will perform the copy from the mounted volume that was mirrored to the wait sidecar.

The only disadvantage of this technique is that it is still subject to the same timing issue described in approach 1, where it is possible for the main container to run to completion before the wait sidecar even starts and has the chance to secure the file handle. That said, this only affects a corner case where the artifact being collected is located on the base image layer, and not a volumeMount on the main container. If the artifact resides on a volume mount (including an emptyDir), then the sidecar still has the opportunity to copy the file.

@srikumar-b
Copy link

Could you let us know, any further development on finalizing the approach from the above list?

@jessesuen
Copy link
Member Author

@srikumar-b yes, I'm close to finishing approach #4.

@xianlubird
Copy link
Member

@jessesuen Any update on this ? Is there a PR for thie issue?

@animeshsingh
Copy link

Coming to this from kubeflow pipeline community.
kubeflow/pipelines#678

We have phased out Docker support in IBM Kubernetes Service, and are now relying on containerd, This is one blocking issue currently impacting the adotption for us. Any further updates on thsi?

@zak-hassan
Copy link

zak-hassan commented Mar 2, 2019

Seems like this might be the problem:

https://github.com/zmhassan/argo/blob/master/workflow/controller/workflowpod.go#L52-L59

Users shouldn't be required to mount hostpath.

@zak-hassan
Copy link

zak-hassan commented Mar 2, 2019

Perhaps you can pass in an option into your Custom Resource to disable this as some users might just be using argo just for running workflows and might not require copying files. One option could be to have the code within the container running the job connect to s3 or some cloud storage to store these files. or even minio could be a drop in option.

@llegolas
Copy link

llegolas commented Mar 7, 2019

Perhaps you can pass in an option into your Custom Resource to disable this as some users might just be using argo just for running workflows and might not require copying files. One option could be to have the code within the container running the job connect to s3 or some cloud storage to store these files. or even minio could be a drop in option.

Do you know how this option can be passed or is it existing at all ?

@jessesuen
Copy link
Member Author

Fixed.

K8s, Kubelete, and PNS executors will no longer mount docker.sock. The docker executor still needs docker.sock access even if not copying out artifacts, in order to perform a docker wait for waiting for the container to complete.

@animeshsingh
Copy link

Thanks @jessesuen - can we now use either K8s and PNS executors, and get same functional behaviour from Argo, if not scalability?

@jessesuen
Copy link
Member Author

can we now use either K8s and PNS executors, and get same functional behaviour from Argo, if not scalability?

The main difference is the K8s executor puts some additional polling load on k8s API server for waiting for container completion, whereas PNS is polling the local linux kernel (procfs) for container completion. Also PNS requires K8s 1.12 for PNS support (without the feature gate).

I'm copying my pro/con list from the other issue.

  1. Docker:
+ supports all workflow examples
+ most reliable and well tested
+ very scalable. communicates to docker daemon for heavy lifting
- least secure. requires docker.sock of host to be mounted (often rejected by OPA)
  1. kubelet
+ secure. cannot escape privileges of pod's service account
+ medium scalability - log retrieval and container polling is done against kubelet
- additional kubelet configuration may be required
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)
  1. K8s API
+ secure. cannot escape privileges of pod's service account
+ no extra configuration
- least scalable - log retrieval and container polling is done against k8s API server
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)
  1. PNS
+ secure. cannot escape privileges of service account
+ artifact collection can be collected from base image layer
+ scalable - process polling is done over procfs and not kubelet/k8s API
- processes will no longer run with pid 1
- artifact collection from base image may fail for containers which complete too fast
- cannot capture artifact directories from base image layer which has a volume mounted under it
- immature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants