Skip to content

Conversation

@fhan688
Copy link

@fhan688 fhan688 commented Dec 26, 2018

What changes were proposed in this pull request?

Spark running on kubernetes now starts driver pod in Pod kind by default , which has a problem that entire job fails when host machine crashs. In other words , driver pod can not failover in kind of Pod in this situation. So , we add kind configuration of driver pod which supports Pod、Deployment、Job. For example , in streaming jobs , starting driver pod in Deployment kind will ensure the driver service high available even there is host-machine-crash. In batch jobs , there is configurable backoffLimits for retry.

How was this patch tested?

We test in production env. Starting driver in Deployment or Job kind can make driver high available when host machine crashs.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@liyinan926
Copy link
Contributor

There has been some discussions on using a Job to run the driver has been discussed before in #21067. We decided not to adopt that approach for the reasons discussed in the PR. A Deployment has the same problem with a Job, i.e., lack of exactly-once semantics. If you need high availability and auto restart/retry support for the driver, the K8S Spark Operator is worth taking a look.

@vanzin
Copy link
Contributor

vanzin commented Jan 2, 2019

@liyinan926 looks like you're saying this PR should be closed and the bug marked "won't fix" (or maybe duplicate)?

@liyinan926
Copy link
Contributor

@vanzin Yes.

@vanzin
Copy link
Contributor

vanzin commented Jan 2, 2019

Alright then, closing on your suggestion.

@vanzin vanzin closed this Jan 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants