Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This is a WIP fix for SPARK-4498; this isn't the final fix that I want to merge in, but I'm submitting this now to get early feedback from Jenkins and reviewers. The main idea here is to add a periodic driver -> master heartbeat that both signals driver liveness and carries information on whether it the driver has received executors, which allows us to implement proper "don't kill an application due to failed executors as long as it has some running executors" logic in the master.

See discussion at https://issues.apache.org/jira/browse/SPARK-4498 for context.

Before merging, this needs more comments and tests. Specifically, I need tests to check that the heartbeat's information actually corresponds to the right notion of application progress / liveness. There's also open questions about heartbeat interval configuration and failure thresholds. I'll edit this description to accurately reflect the PR before I remove the [WIP] tag.

/cc @markhamstra @aarondav @andrewor14 @pwendell @airhorns

@JoshRosen
Copy link
Contributor Author

I'll be back later tonight to continue working on this.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24024 has started for PR 3548 at commit 418af7e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24024 has finished for PR 3548 at commit 418af7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AppClientHeartbeat(appId: String, hasExecutors: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24024/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe hasRegisteredExecutors to be more specific. We don't actually care if they're running any tasks

@JoshRosen
Copy link
Contributor Author

I'm working on pulling in SPARK-2424 as well, but I've run into one minor naming snag: what do I call the new threshold? I thought of maxConsecutiveExecutorFailures but that name is kind of misleading since it implies that the application will always be terminated if more than that many failures occur, which isn't always the case (an application which reports that it has at least one registered executor will never be terminated by this mechanism). If we want to be really specific, I suppose that minConsecutiveExecutorFailuresBeforeAppFailure is better, since it conveys that at least that many failures must occur before we will consider killing the app. That's too long, though. Anyone have a quick suggestion for a better name?

Update: I'm now considering consecutiveExecutorFailuresThreshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be cleaner to have the timer trigger a self-message which triggers the sending of the heartbeat to the driver? This adds more indirection but lets me remove the synchronization / thread-safety stuff.

Akka experts: is that a more idiomatic way of doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the docs and can confirm that we should be sending a self-message: http://doc.akka.io/docs/akka/snapshot/scala/scheduler.html#Some_examples. I'll fix this up now.

@JoshRosen JoshRosen changed the title [SPARK-4498] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic [SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic Dec 2, 2014
@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24031 has started for PR 3548 at commit 4132892.

  • This patch merges cleanly.

@markhamstra
Copy link
Contributor

Seems needlessly complicated to me. I'm still doing tests, but it seems to me that all that is required is #3550

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24034 has started for PR 3548 at commit 6cc7cad.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24035 has started for PR 3548 at commit 9dce0d4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24031 has finished for PR 3548 at commit 4132892.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AppClientHeartbeat(appId: String, hasRegisteredExecutors: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24031/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24034 has finished for PR 3548 at commit 6cc7cad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AppClientHeartbeat(appId: String, hasRegisteredExecutors: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24034/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24035 has finished for PR 3548 at commit 9dce0d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AppClientHeartbeat(appId: String, hasRegisteredExecutors: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24035/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

I'm going to close this for now in favor of Mark's patch (#3550). There are a couple of ideas here that might be useful for future improvements to this code (including factoring out the policy into a separate file for easier testing, which would be important if we added features like timeout-based host blacklisting), but I agree that this PR is more complex than we need for a narrow fix for this bug.

@JoshRosen JoshRosen closed this Dec 2, 2014
@markhamstra
Copy link
Contributor

#3550 doesn't address SPARK-2424; so if we want to handle that issue in 1.2, then we still need a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants