-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic #3548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'll be back later tonight to continue working on this. |
|
Test build #24024 has started for PR 3548 at commit
|
|
Test build #24024 has finished for PR 3548 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe hasRegisteredExecutors to be more specific. We don't actually care if they're running any tasks
|
I'm working on pulling in SPARK-2424 as well, but I've run into one minor naming snag: what do I call the new threshold? I thought of Update: I'm now considering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be cleaner to have the timer trigger a self-message which triggers the sending of the heartbeat to the driver? This adds more indirection but lets me remove the synchronization / thread-safety stuff.
Akka experts: is that a more idiomatic way of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the docs and can confirm that we should be sending a self-message: http://doc.akka.io/docs/akka/snapshot/scala/scheduler.html#Some_examples. I'll fix this up now.
…old configurable.
|
Test build #24031 has started for PR 3548 at commit
|
|
Seems needlessly complicated to me. I'm still doing tests, but it seems to me that all that is required is #3550 |
|
Test build #24034 has started for PR 3548 at commit
|
|
Test build #24035 has started for PR 3548 at commit
|
|
Test build #24031 has finished for PR 3548 at commit
|
|
Test PASSed. |
|
Test build #24034 has finished for PR 3548 at commit
|
|
Test PASSed. |
|
Test build #24035 has finished for PR 3548 at commit
|
|
Test PASSed. |
|
I'm going to close this for now in favor of Mark's patch (#3550). There are a couple of ideas here that might be useful for future improvements to this code (including factoring out the policy into a separate file for easier testing, which would be important if we added features like timeout-based host blacklisting), but I agree that this PR is more complex than we need for a narrow fix for this bug. |
|
#3550 doesn't address SPARK-2424; so if we want to handle that issue in 1.2, then we still need a PR for it. |
This is a WIP fix for SPARK-4498; this isn't the final fix that I want to merge in, but I'm submitting this now to get early feedback from Jenkins and reviewers. The main idea here is to add a periodic driver -> master heartbeat that both signals driver liveness and carries information on whether it the driver has received executors, which allows us to implement proper "don't kill an application due to failed executors as long as it has some running executors" logic in the master.
See discussion at https://issues.apache.org/jira/browse/SPARK-4498 for context.
Before merging, this needs more comments and tests. Specifically, I need tests to check that the heartbeat's information actually corresponds to the right notion of application progress / liveness. There's also open questions about heartbeat interval configuration and failure thresholds. I'll edit this description to accurately reflect the PR before I remove the
[WIP]tag./cc @markhamstra @aarondav @andrewor14 @pwendell @airhorns