[SPARK-7831][Mesos] Added flag to shutdown driver when mesos dispatch… #10701

nraychaudhuri · 2016-01-11T19:42:27Z

…er is stopped

nraychaudhuri · 2016-01-13T14:16:57Z

@dragos @tnachen @skyluc Could you please take a look at this one?

skyluc · 2016-01-14T14:41:13Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

Instead of killing without failover, we could also start it without failover.

In the start method, to use:

val driver = createSchedulerDriver( master, MesosClusterScheduler.this, Utils.getCurrentUserName(), appName, conf, Some(frameworkUrl), Some(driverFailOver), // <-- with or without checkpoint data Some(if (driverFailOver) Double.MaxValue else 0.0), // <-- timeout for failover recovery fwId)

Great find @skyluc

I will make the change

tnachen · 2016-01-14T17:58:49Z

jenkins please test

tnachen · 2016-01-14T17:59:15Z

Besides what @skyluc and my comments I think this patch LGTM. Have you tested this btw?

nraychaudhuri · 2016-01-14T18:09:57Z

Yes. I have tested this and it seems to work. I will make the necessary changes

dragos · 2016-01-26T10:15:56Z

ok to test

dragos · 2016-01-26T10:23:29Z

I confirm that the framework deregisters from Mesos. However, I don't see the old behavior anymore, where the framework stays even after stopping it. The new flag seems to have no effect.

$ sbin/start-mesos-dispatcher.sh  --master mesos://lausanne1.local:5050
starting org.apache.spark.deploy.mesos.MesosClusterDispatcher, logging to /Users/dragos/workspace/Spark/dev/spark/logs/spark-dragos-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-sagitarius.local.out
$ sbin/stop-mesos-dispatcher.sh 
stopping org.apache.spark.deploy.mesos.MesosClusterDispatcher

The framework is gone.

SparkQA · 2016-01-26T12:11:56Z

Test build #50096 has finished for PR 10701 at commit 9002258.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2016-02-14T09:00:34Z

@dragos you mean the framework no longer shows up in the UI? the console output doesn't seem to suggest it's gone.

dragos · 2016-02-14T17:09:46Z

On 14 feb. 2016, at 10:01, Timothy Chen notifications@github.com wrote:

@dragos you mean the framework no longer shows up in the UI? the console output doesn't seem to suggest it's gone.

Yes, that's what I mean.

—
Reply to this email directly or view it on GitHub.

tnachen · 2016-03-02T01:52:17Z

I've tested this myself and is indeed now doing the correct behavior when not adding the flag in. I'll need to dig more, @nraychaudhuri have you tried this as well?

tnachen · 2016-03-04T01:01:31Z

I just found out that this is actually a bug in Mesos, where we cannot store a duration that's larger than int64_t. I filed a Mesos jira for this (https://issues.apache.org/jira/browse/MESOS-4862).
As a workaround, please don't use Double.MAX_VALUE but use Integer.MAX_VALUE instead which is what I did before, I forgot about hitting this in the past. We should also leave a comment to make sure we don't change this until it's fixed.

andrewor14 · 2016-03-29T00:23:55Z

OK, let's not add a flag if it's a bug in Mesos. In the mean time before they fix it downstream we can use the workaround @tnachen suggested.

dragos · 2016-03-30T11:48:43Z

Sounds good. Who can close this PR?

andrewor14 · 2016-03-30T22:01:50Z

@nraychaudhuri can you close this PR?

tnachen · 2016-04-19T23:47:02Z

@andrewor14 @nraychaudhuri @dragos Sorry I'm not suggesting we close this PR, we still need the flag since we want to be able to either failover automatically or not. We only need to revert the particular line of change where the PR changed the timeout to DOUBLE.MAX_VALUE

srowen · 2016-05-06T17:27:27Z

@nraychaudhuri can you update or close this PR then?

tnachen · 2016-05-10T19:25:41Z

Seems like @nraychaudhuri is busy, I'll take this PR and update it myself. We definitely need this to be merged as it's quite useful for testing.

Closing the following PRs due to requests or unresponsive users. Closes apache#13923 Closes apache#14462 Closes apache#13123 Closes apache#14423 (requested by srowen) Closes apache#14424 (requested by srowen) Closes apache#14101 (requested by jkbradley) Closes apache#10676 (requested by srowen) Closes apache#10943 (requested by yhuai) Closes apache#9936 Closes apache#10701

[SPARK-7831][Mesos] Added flag to shutdown driver when mesos dispatch…

2cc0022

…er is stopped

nraychaudhuri mentioned this pull request Jan 13, 2016

Added flag to shutdown driver when dispatcher is stopped lightbend/spark#23

Closed

skyluc reviewed Jan 14, 2016
View reviewed changes

Setting the driver failover timeout

9002258

vanzin mentioned this pull request Aug 4, 2016

MAINTENANCE. Cleaning up stale PRs. #14495

Closed

asfgit closed this in 53e766c Aug 4, 2016

[SPARK-7831][Mesos] Added flag to shutdown driver when mesos dispatch… #10701

[SPARK-7831][Mesos] Added flag to shutdown driver when mesos dispatch… #10701

Uh oh!

Conversation

nraychaudhuri commented Jan 11, 2016

Uh oh!

nraychaudhuri commented Jan 13, 2016

Uh oh!

skyluc Jan 14, 2016

Choose a reason for hiding this comment

Uh oh!

nraychaudhuri Jan 14, 2016

Choose a reason for hiding this comment

Uh oh!

tnachen commented Jan 14, 2016

Uh oh!

tnachen commented Jan 14, 2016

Uh oh!

nraychaudhuri commented Jan 14, 2016

Uh oh!

dragos commented Jan 26, 2016

Uh oh!

dragos commented Jan 26, 2016

Uh oh!

SparkQA commented Jan 26, 2016

Uh oh!

tnachen commented Feb 14, 2016

Uh oh!

dragos commented Feb 14, 2016

Uh oh!

tnachen commented Mar 2, 2016

Uh oh!

tnachen commented Mar 4, 2016

Uh oh!

andrewor14 commented Mar 29, 2016

Uh oh!

dragos commented Mar 30, 2016

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

tnachen commented Apr 19, 2016

Uh oh!

srowen commented May 6, 2016

Uh oh!

tnachen commented May 10, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants