Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Node Affinity for TaskRuns that share PVC workspace #2630

Commits on May 22, 2020

  1. Add Node Affinity for TaskRuns that share PVC workspace

    TaskRuns within a PipelineRun may share files using a workspace volume.
    The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
    perform operations on the filesystem, e.g. generate files or analyze files,
    so the workspace abstraction is very useful.
    
    The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
    PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
    The most commonly available PV access mode is ReadWriteOnce, volumes with this
    access mode can only be mounted on one Node at a time.
    
    When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
    scheduled to any Node, most likely not to the same Node in a cluster.
    Since volumes with the commonly available ReadWriteOnce access mode cannot
    be use by multiple nodes at a time, these "parallel" pods is forced to
    execute sequentially, since the volume only is available on one node at a time.
    This may make that your TaskRuns time out.
    
    Clusters are often _regional_, e.g. they are deployed across 3 Availability
    Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
    for the Nodes within a single zone. Some cloud providers offer regional PVs,
    but sometimes regional PVs is only replicated to one additional zone, e.g. not
    all 3 zones within a region. This works fine for most typical stateful application,
    but Tekton uses storage in a different way - it is designed so that multiple pods
    access the same volume, in a sequece or parallel.
    
    This makes it difficult to design a Pipeline that starts with parallel tasks using
    its own PVC and then have a common tasks that mount the volume from the earlier
    tasks - since - what happens if those tasks were scheduled to different zones -
    the common task can not mount the PVCs that now is located in different zones, so
    the PipelineRun is deadlocked.
    
    There are a few technical solutions that offer parallel executions of Tasks
    even when sharing PVC workspace:
    
    - Using PVC access mode ReadWriteMany. But this access mode is not widely available,
      and is typically a NFS server or another not so "cloud native" solution.
    
    - An alternative is to use a storage that is tied to a specific node, e.g. local volume
      and then configure so pods are scheduled to this node, but this is not commonly
      available and it has drawbacks, e.g. the pod may need to consume and mount a whole
      disk e.g. several hundreds GB.
    
    Consequently, it would be good to find a way so that TaskRun pods that share
    workspace are scheduled to the same Node - and thereby make it easy to use parallel
    tasks with workspace - while executing concurrently - on widely available Kubernetes
    cluster and storage configurations.
    
    A few alternative solutions have been considered, as documented in tektoncd#2586.
    However, they all have major drawbacks, e.g. major API and contract changes.
    
    This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
    so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.
    
    This solution has several benefits: it does not introduce any API changes,
    it does not break or change any existing Tekton concepts and it is
    implemented with very few changes. Additionally it can be disabled with a feature-flag.
    
    **How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
    created for each PVC workspace volume. TaskRun pods that share workspace
    volume is configured with podAffinity to the "Affinity Assisant" pod that
    was created for the volume. The "Affinity Assistant" lives until the
    PipelineRun is completed, or deleted. "Affinity Assistant" pods are
    configured with podAntiAffinity to repel other "Affinity Assistants" -
    in a Best Effort fashion.
    
    The Affinity Assistant is _singleton_ workload, since it acts as a
    placeholder pod and TaskRun pods with affinity must be scheduled to the
    same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
    since it does not provide any work other than beeing a placeholder.
    
    Singleton workloads can be implemented in multiple ways, and they differ
    in behavior when the Node becomes unreachable:
    
    - as a Pod - the Pod is not managed, so it will not be recreated.
    - as a Deployment - the Pod will be recreated and puts Availability before
      the singleton property
    - as a StatefulSet - the Pod will be recreated but puds the singleton
      property before Availability
    
    Therefor the Affinity Assistant is implemented as a StatefulSet.
    
    Essentialy this commit provides an effortless way to use a functional
    task parallelism with any Kubernetes cluster that has any PVC based
    storage.
    
    Solves tektoncd#2586
    /kind feature
    jlpettersson committed May 22, 2020
    Configuration menu
    Copy the full SHA
    77db014 View commit details
    Browse the repository at this point in the history