Kubernetes ThirdPartyResource for tracking Spark Jobs

**Issues**:

* We need a way for users to find, monitor status, and kill their spark jobs.
* Other cluster managers provide an interface to view all of a user's spark-jobs in a cluster, along with their state (QUEUED,  SUBMITTED, RUNNING, FINISHED, FAILED, KILLED).

**Shortcomings of the current method**:

* It is not possible to get sufficient detail about every spark job just by looking at the pods and their states.
* As the spark framework is currently performing the entire function of a kubernetes controller, it gives us no way to find the state of various running Spark jobs than to implement `spark-submit --status`, or `spark-submit --kill` which would have to encode special logic to find the drivers in a particular namespace and find their status. This would be bad for cluster administrators who deal with kubernetes abstractions, and not application specifics.
* Accessing arbitrary data such as progress from the driver pod, or the names of its associated executors is difficult. 

**Proposed Solution**:

A [ThirdPartyResource](http://kubernetes.io/docs/user-guide/thirdpartyresources/) to keep track of the execution state (pending/running/failed) of each SparkJob, failure reasons if any, the identity of the driver and executor pods associated, as well as configuration metadata associated with that job (number of executors, memory per executor, etc). 

```
metadata:
  name: spark-job.kubernetes.io
  labels:
    resource: spark-job
    object: spark
apiVersion: extensions/v1beta1
kind: ThirdPartyResource
description: "A resource that manages a spark job"
versions:
  - name: v1
```

The cluster administrator is responsible for creating this resource, which makes a new API endpoint available in Kubernetes. This TPR would enable us to create objects of the kind `SparkJob` and store JSON within them. Each such object would be associated with a single spark job, and would store all the status and metadata associated with it. The driver pod is responsible for the life-cycle of the SparkJob object, from creation till deletion.

A sample object of the above kind looks like the following:

```
{
    "apiVersion": "kubernetes.io/v1",
    "image": "driver-image",
    "kind": "SparkJob",
    "metadata": {
        "name": "spark-driver-1924",
        "namespace": "default",
        "selfLink": "/apis/kubernetes.io/v1/namespaces/default/sparkjobs/spark-driver-1924",
        "uid": "91022bc2-a71d-11e6-a4be-42010af00002",
        "resourceVersion": "765519",
        "creationTimestamp": "2016-11-10T08:13:31Z"
    },
    "num-executors": 10,
    "state": "completed",
    "driver-pod": "driver-2ds9f"
    ...
    ...
}

```

The driver pod has complete visibility into progress of the job, and can set the status of its SparkJob object.  The driver can also watch this resource for configuration changes which may be triggered by the user/cluster administrator. Killing a spark-job can be performed by destroying the associated SparkJob object, which will cause the driver pod to terminate its executors and clean up gracefully.

- It makes the state of each SparkJob available outside Spark itself and provides some visibility to the cluster administrator regarding various running spark jobs in the system.
- This SparkJob object can be consumed by `spark-submit --status`, or a dashboard to display various details about spark jobs in the system. 

**Further thought**:
* What if the driver pods do not exit cleanly (or are force-killed)? Who is responsible for cleaning up the SparkJob object?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes ThirdPartyResource for tracking Spark Jobs #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Kubernetes ThirdPartyResource for tracking Spark Jobs #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions