-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7831][Mesos] Added flag to shutdown driver when mesos dispatch… #10701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of killing without failover, we could also start it without failover.
In the start method, to use:
val driver = createSchedulerDriver(
master,
MesosClusterScheduler.this,
Utils.getCurrentUserName(),
appName,
conf,
Some(frameworkUrl),
Some(driverFailOver), // <-- with or without checkpoint data
Some(if (driverFailOver) Double.MaxValue else 0.0), // <-- timeout for failover recovery
fwId)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find @skyluc
I will make the change
|
jenkins please test |
|
Besides what @skyluc and my comments I think this patch LGTM. Have you tested this btw? |
|
Yes. I have tested this and it seems to work. I will make the necessary changes |
|
ok to test |
|
I confirm that the framework deregisters from Mesos. However, I don't see the old behavior anymore, where the framework stays even after stopping it. The new flag seems to have no effect. The framework is gone. |
|
Test build #50096 has finished for PR 10701 at commit
|
|
@dragos you mean the framework no longer shows up in the UI? the console output doesn't seem to suggest it's gone. |
|
|
I've tested this myself and is indeed now doing the correct behavior when not adding the flag in. I'll need to dig more, @nraychaudhuri have you tried this as well? |
|
I just found out that this is actually a bug in Mesos, where we cannot store a duration that's larger than int64_t. I filed a Mesos jira for this (https://issues.apache.org/jira/browse/MESOS-4862). |
|
OK, let's not add a flag if it's a bug in Mesos. In the mean time before they fix it downstream we can use the workaround @tnachen suggested. |
|
Sounds good. Who can close this PR? |
|
@nraychaudhuri can you close this PR? |
|
@andrewor14 @nraychaudhuri @dragos Sorry I'm not suggesting we close this PR, we still need the flag since we want to be able to either failover automatically or not. We only need to revert the particular line of change where the PR changed the timeout to DOUBLE.MAX_VALUE |
|
@nraychaudhuri can you update or close this PR then? |
|
Seems like @nraychaudhuri is busy, I'll take this PR and update it myself. We definitely need this to be merged as it's quite useful for testing. |
Closing the following PRs due to requests or unresponsive users. Closes apache#13923 Closes apache#14462 Closes apache#13123 Closes apache#14423 (requested by srowen) Closes apache#14424 (requested by srowen) Closes apache#14101 (requested by jkbradley) Closes apache#10676 (requested by srowen) Closes apache#10943 (requested by yhuai) Closes apache#9936 Closes apache#10701
Fix for SPARK-7831