Skip to content

Kubernetes ThirdPartyResource for tracking Spark Jobs #3

@foxish

Description

@foxish

Issues:

  • We need a way for users to find, monitor status, and kill their spark jobs.
  • Other cluster managers provide an interface to view all of a user's spark-jobs in a cluster, along with their state (QUEUED, SUBMITTED, RUNNING, FINISHED, FAILED, KILLED).

Shortcomings of the current method:

  • It is not possible to get sufficient detail about every spark job just by looking at the pods and their states.
  • As the spark framework is currently performing the entire function of a kubernetes controller, it gives us no way to find the state of various running Spark jobs than to implement spark-submit --status, or spark-submit --kill which would have to encode special logic to find the drivers in a particular namespace and find their status. This would be bad for cluster administrators who deal with kubernetes abstractions, and not application specifics.
  • Accessing arbitrary data such as progress from the driver pod, or the names of its associated executors is difficult.

Proposed Solution:

A ThirdPartyResource to keep track of the execution state (pending/running/failed) of each SparkJob, failure reasons if any, the identity of the driver and executor pods associated, as well as configuration metadata associated with that job (number of executors, memory per executor, etc).

metadata:
  name: spark-job.kubernetes.io
  labels:
    resource: spark-job
    object: spark
apiVersion: extensions/v1beta1
kind: ThirdPartyResource
description: "A resource that manages a spark job"
versions:
  - name: v1

The cluster administrator is responsible for creating this resource, which makes a new API endpoint available in Kubernetes. This TPR would enable us to create objects of the kind SparkJob and store JSON within them. Each such object would be associated with a single spark job, and would store all the status and metadata associated with it. The driver pod is responsible for the life-cycle of the SparkJob object, from creation till deletion.

A sample object of the above kind looks like the following:

{
    "apiVersion": "kubernetes.io/v1",
    "image": "driver-image",
    "kind": "SparkJob",
    "metadata": {
        "name": "spark-driver-1924",
        "namespace": "default",
        "selfLink": "/apis/kubernetes.io/v1/namespaces/default/sparkjobs/spark-driver-1924",
        "uid": "91022bc2-a71d-11e6-a4be-42010af00002",
        "resourceVersion": "765519",
        "creationTimestamp": "2016-11-10T08:13:31Z"
    },
    "num-executors": 10,
    "state": "completed",
    "driver-pod": "driver-2ds9f"
    ...
    ...
}

The driver pod has complete visibility into progress of the job, and can set the status of its SparkJob object. The driver can also watch this resource for configuration changes which may be triggered by the user/cluster administrator. Killing a spark-job can be performed by destroying the associated SparkJob object, which will cause the driver pod to terminate its executors and clean up gracefully.

  • It makes the state of each SparkJob available outside Spark itself and provides some visibility to the cluster administrator regarding various running spark jobs in the system.
  • This SparkJob object can be consumed by spark-submit --status, or a dashboard to display various details about spark jobs in the system.

Further thought:

  • What if the driver pods do not exit cleanly (or are force-killed)? Who is responsible for cleaning up the SparkJob object?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions