[SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic #3548

JoshRosen · 2014-12-02T05:00:39Z

This is a WIP fix for SPARK-4498; this isn't the final fix that I want to merge in, but I'm submitting this now to get early feedback from Jenkins and reviewers. The main idea here is to add a periodic driver -> master heartbeat that both signals driver liveness and carries information on whether it the driver has received executors, which allows us to implement proper "don't kill an application due to failed executors as long as it has some running executors" logic in the master.

See discussion at https://issues.apache.org/jira/browse/SPARK-4498 for context.

Before merging, this needs more comments and tests. Specifically, I need tests to check that the heartbeat's information actually corresponds to the right notion of application progress / liveness. There's also open questions about heartbeat interval configuration and failure thresholds. I'll edit this description to accurately reflect the PR before I remove the [WIP] tag.

/cc @markhamstra @aarondav @andrewor14 @pwendell @airhorns

JoshRosen · 2014-12-02T05:01:34Z

I'll be back later tonight to continue working on this.

SparkQA · 2014-12-02T05:05:15Z

Test build #24024 has started for PR 3548 at commit 418af7e.

This patch merges cleanly.

SparkQA · 2014-12-02T06:25:16Z

Test build #24024 has finished for PR 3548 at commit 418af7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AppClientHeartbeat(appId: String, hasExecutors: Boolean)

AmplabJenkins · 2014-12-02T06:25:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24024/
Test PASSed.

andrewor14 · 2014-12-02T06:30:31Z

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

maybe hasRegisteredExecutors to be more specific. We don't actually care if they're running any tasks

JoshRosen · 2014-12-02T07:12:13Z

I'm working on pulling in SPARK-2424 as well, but I've run into one minor naming snag: what do I call the new threshold? I thought of maxConsecutiveExecutorFailures but that name is kind of misleading since it implies that the application will always be terminated if more than that many failures occur, which isn't always the case (an application which reports that it has at least one registered executor will never be terminated by this mechanism). If we want to be really specific, I suppose that minConsecutiveExecutorFailuresBeforeAppFailure is better, since it conveys that at least that many failures must occur before we will consider killing the app. That's too long, though. Anyone have a quick suggestion for a better name?

Update: I'm now considering consecutiveExecutorFailuresThreshold.

JoshRosen · 2014-12-02T07:14:22Z

core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala

Do you think it would be cleaner to have the timer trigger a self-message which triggers the sending of the heartbeat to the driver? This adds more indirection but lets me remove the synchronization / thread-safety stuff.

Akka experts: is that a more idiomatic way of doing this?

Checked the docs and can confirm that we should be sending a self-message: http://doc.akka.io/docs/akka/snapshot/scala/scheduler.html#Some_examples. I'll fix this up now.

…old configurable.

SparkQA · 2014-12-02T07:30:09Z

Test build #24031 has started for PR 3548 at commit 4132892.

This patch merges cleanly.

markhamstra · 2014-12-02T07:45:31Z

Seems needlessly complicated to me. I'm still doing tests, but it seems to me that all that is required is #3550

SparkQA · 2014-12-02T07:52:52Z

Test build #24034 has started for PR 3548 at commit 6cc7cad.

This patch merges cleanly.

SparkQA · 2014-12-02T08:05:10Z

Test build #24035 has started for PR 3548 at commit 9dce0d4.

This patch merges cleanly.

SparkQA · 2014-12-02T08:49:12Z

Test build #24031 has finished for PR 3548 at commit 4132892.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AppClientHeartbeat(appId: String, hasRegisteredExecutors: Boolean)

AmplabJenkins · 2014-12-02T08:49:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24031/
Test PASSed.

SparkQA · 2014-12-02T09:11:28Z

Test build #24034 has finished for PR 3548 at commit 6cc7cad.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AppClientHeartbeat(appId: String, hasRegisteredExecutors: Boolean)

AmplabJenkins · 2014-12-02T09:11:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24034/
Test PASSed.

SparkQA · 2014-12-02T09:24:35Z

Test build #24035 has finished for PR 3548 at commit 9dce0d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AppClientHeartbeat(appId: String, hasRegisteredExecutors: Boolean)

AmplabJenkins · 2014-12-02T09:24:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24035/
Test PASSed.

JoshRosen · 2014-12-02T19:07:08Z

I'm going to close this for now in favor of Mark's patch (#3550). There are a couple of ideas here that might be useful for future improvements to this code (including factoring out the policy into a separate file for easier testing, which would be important if we added features like timeout-based host blacklisting), but I agree that this PR is more complex than we need for a narrow fix for this bug.

markhamstra · 2014-12-02T21:09:38Z

#3550 doesn't address SPARK-2424; so if we want to handle that issue in 1.2, then we still need a PR for it.

JoshRosen added 3 commits December 1, 2014 17:08

Factor application failure detector logic into own class; add tests.

87d7960

[SPARK-4498] [WIP] Add driver -> master heartbeat

08746eb

Revert debugging comment

418af7e

andrewor14 reviewed Dec 2, 2014
View reviewed changes

JoshRosen added 3 commits December 1, 2014 22:58

hasRunningExecutors -> hasRegisteredExecutors

14607d8

Add ": Unit = {"; fix comment typo.

d005daf

Increase spark.app.heartbeatInterval to 60 seconds

fe26761

JoshRosen reviewed Dec 2, 2014
View reviewed changes

[SPARK-2424] Make min consecutive executor failure app failure thresh…

4132892

…old configurable.

JoshRosen changed the title ~~[SPARK-4498] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic~~ [SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic Dec 2, 2014

Send self-message instead of using synchronization from timer task

6cc7cad

Remove unnecessary synchronization

9dce0d4

JoshRosen closed this Dec 2, 2014

[SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic #3548

[SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic #3548

Uh oh!

Conversation

JoshRosen commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

AmplabJenkins commented Dec 2, 2014

Uh oh!

andrewor14 Dec 2, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

JoshRosen Dec 2, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen Dec 2, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

AmplabJenkins commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

AmplabJenkins commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

AmplabJenkins commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants