[SPARK-3591][YARN]fire and forget for YARN cluster mode #5297

WangTaoTheTonic · 2015-03-31T13:17:51Z

https://issues.apache.org/jira/browse/SPARK-3591

The output after this patch:

doggie153:/opt/oss/spark-1.3.0-bin-hadoop2.4/bin # ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster ../lib/spark-examples*.jar
15/03/31 21:15:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/31 21:15:25 INFO RMProxy: Connecting to ResourceManager at doggie153/10.177.112.153:8032
15/03/31 21:15:25 INFO Client: Requesting a new application from cluster with 4 NodeManagers
15/03/31 21:15:25 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/03/31 21:15:25 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/03/31 21:15:25 INFO Client: Setting up container launch context for our AM
15/03/31 21:15:25 INFO Client: Preparing resources for our AM container
15/03/31 21:15:26 INFO Client: Uploading resource file:/opt/oss/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.4.1.jar -> hdfs://doggie153:9000/user/root/.sparkStaging/application_1427257505534_0016/spark-assembly-1.4.0-SNAPSHOT-hadoop2.4.1.jar
15/03/31 21:15:27 INFO Client: Uploading resource file:/opt/oss/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://doggie153:9000/user/root/.sparkStaging/application_1427257505534_0016/spark-examples-1.3.0-hadoop2.4.0.jar
15/03/31 21:15:28 INFO Client: Setting up the launch environment for our AM container
15/03/31 21:15:28 INFO SecurityManager: Changing view acls to: root
15/03/31 21:15:28 INFO SecurityManager: Changing modify acls to: root
15/03/31 21:15:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/03/31 21:15:28 INFO Client: Submitting application 16 to ResourceManager
15/03/31 21:15:28 INFO YarnClientImpl: Submitted application application_1427257505534_0016
15/03/31 21:15:28 INFO Client: ... waiting before polling ResourceManager for application state
15/03/31 21:15:33 INFO Client: ... polling ResourceManager for application state
15/03/31 21:15:33 INFO Client: Application report for application_1427257505534_0016 (state: RUNNING)
15/03/31 21:15:33 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: doggie157
ApplicationMaster RPC port: 0
queue: default
start time: 1427807728307
final status: UNDEFINED
tracking URL: http://doggie153:8088/proxy/application_1427257505534_0016/
user: root

/cc @andrewor14

SparkQA · 2015-03-31T13:23:20Z

Test build #29482 has started for PR 5297 at commit 0cbdce8.

SparkQA · 2015-03-31T14:30:51Z

Test build #29482 has finished for PR 5297 at commit 0cbdce8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateStruct(children: Seq[NamedExpression]) extends Expression
This patch does not change any dependencies.

AmplabJenkins · 2015-03-31T14:30:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29482/
Test FAILed.

WangTaoTheTonic · 2015-03-31T16:49:07Z

I will look up into the test cases and fix them.

sryza · 2015-03-31T16:49:58Z

If somebody has an existing script that runs spark-submit and then follows it with an action that expects the app to have completed, this would break it, right?

I think we might need to expose this as a config option instead of changing the default behavior.

srowen · 2015-03-31T16:51:24Z

Is this that different from just backgrounding the process? which can still be monitored, produces logs, etc.

WangTaoTheTonic · 2015-03-31T16:58:35Z

@sryza Yeah I'd like to keep it compatible too. In standalone cluster mode there seems no config option like this. Maybe we should add it there too.

@srowen Backgrouding means there still exist the client process. "fire and forget" means client just submit the application, then exit.

SparkQA · 2015-04-01T02:32:36Z

Test build #29519 has started for PR 5297 at commit 19706c0.

SparkQA · 2015-04-01T03:36:49Z

Test build #29519 has finished for PR 5297 at commit 19706c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-01T03:36:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29519/
Test FAILed.

WangTaoTheTonic · 2015-04-01T03:46:35Z

Jenkins, test this please.

SparkQA · 2015-04-01T03:48:14Z

Test build #29520 has started for PR 5297 at commit 19706c0.

SparkQA · 2015-04-01T05:14:56Z

Test build #29520 has finished for PR 5297 at commit 19706c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-01T05:15:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29520/
Test PASSed.

tgravescs · 2015-04-01T14:26:42Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I would rather see this combined with monitorapplication. monitorapplication already supports a returnOnRunning option that I think we can use. There is a waitForApplication in the yarnClientSchedulerBackend.

SparkQA · 2015-04-01T16:48:18Z

Test build #29547 has started for PR 5297 at commit 591b752.

WangTaoTheTonic · 2015-04-01T16:49:52Z

@tgravescs Thanks for comments.

Now I reuse monitorapplication and the client process will exist after application begin running, which is a little different with fire-and-forget and I think the difference is tiny enough to be give another thought.

I'd rather leave the option for spark-submit to be added by the people who is eager to.

In case the comments by @sryza (about naming) might be ignored as the code change. I paste them here:

sryza:

What do you think of spark.yarn.client.waitForCompletion? Not a strong preference, but spark.yarn.waitForCompletion is a little ambiguous on who is waiting for who.

WangTaoTheTonic:

would spark.yarn.client.waitForCompletion make client process mixed up with client mode?
Perhaps spark.yarn.waitAppCompletion is better?

tgravescs · 2015-04-01T18:12:19Z

That is a good point, using monitorApplication actually waits for running and that isn't necessarily what we want here. For instance it could be blocked on running waiting on resources. So I apologize, what you had before is better.

So a few comments on the old version.

we don't need to do the sleep. We use the YarnClientImpl.submitApplication which will handle that and even has a config for configuring the poll interval.
we should check the applicationReport you get to make sure the application status isn't failed or killed. If they are failed or killed we should make sure to throw so we exit with proper code

SparkQA · 2015-04-01T18:33:33Z

Test build #29547 has finished for PR 5297 at commit 591b752.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-01T18:33:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29547/
Test PASSed.

SparkQA · 2015-04-02T02:23:22Z

Test build #29580 has started for PR 5297 at commit ba9b22b.

SparkQA · 2015-04-02T03:46:01Z

Test build #29580 has finished for PR 5297 at commit ba9b22b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-02T03:46:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29580/
Test PASSed.

SparkQA · 2015-04-02T16:49:09Z

Test build #29611 has finished for PR 5297 at commit 9106da8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-02T16:49:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29611/
Test PASSed.

tgravescs · 2015-04-02T17:03:50Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I tried this out in a few cases and I noticed if we failed to submit there is no message telling the user why. we need print the report in the case the state failed/killed too. Make sure the diagnostics message is there.

WangTaoTheTonic · 2015-04-03T03:31:17Z

@tgravescs IIUC the output diagnostics message should be:

15/04/03 11:18:47 INFO YarnClientImpl: Submitted application application_1427257505534_0029
15/04/03 11:18:57 INFO Client:
client token: N/A
diagnostics: Application killed by user.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428031127858
final status: KILLED
tracking URL: http://doggie153:8088/cluster/app/application_1427257505534_0029
user: root
Exception in thread "main" org.apache.spark.SparkException: Application application_1427257505534_0029 finished with status: KILLED
at org.apache.spark.deploy.yarn.Client.run(Client.scala:632)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:666)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Note: In order to produce the killed case, I add Thead.sleep(10000) in test code. Hope the interval in log will not cause any mislead.

SparkQA · 2015-04-03T03:32:38Z

Test build #29642 has started for PR 5297 at commit fea390d.

SparkQA · 2015-04-03T04:57:13Z

Test build #29642 has finished for PR 5297 at commit fea390d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-03T04:57:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29642/
Test PASSed.

andrewor14 · 2015-04-03T19:00:23Z

@tgravescs @WangTaoTheTonic I wonder whether we should just use monitorApplication(appId, returnOnRunning = true). It is fire-and-forget to the extent that once we know that an application is running or failed we exit immediately. This is also what we do in standalone cluster mode. Otherwise, I think just printing the first application report is not particularly useful, since we don't know whether the application is actually running.

An argument against doing returnOnRunning, however, is that if the cluster doesn't have enough resources we could theoretically wait an infinite amount of time before exiting. Though even before this patch YARN cluster mode already behaves like that, so doing this is still a strictly better addition. I believe standalone cluster mode has the same behavior. Any thoughts?

tgravescs · 2015-04-03T19:25:40Z

I prefer the current way (not waiting for running). The YarnClient we are using makes sure it gets submitted before returning. Beyond that we shouldn't really have to wait for it to actually run. If your cluster is busy that could be an indefinite amount of time. In many cases I just want to start an application and I don't really care if it actually starts running immediately or in 10 minutes. The first application report is useful if it fails. For instance I submit to a queue that doesn't exist or have some bad config.

Not that this matters to much, but MapReduce also has a submit and no wait option and it doesn't wait for running, just submitted. I only mention it because if people are coming from that they might have expectations on the behavior.

andrewor14 · 2015-04-03T19:55:01Z

I see, then this LGTM pending @tgravescs' other comment.

tgravescs · 2015-04-03T20:37:54Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

this is kind of a nit but we could just move line 633-634 up before this if statement and we wouldn't need this second logInfo(formatReportDetails(report))

sryza · 2015-04-03T21:49:38Z

LGTM as well

tgravescs · 2015-04-06T20:13:20Z

since my only other comment was a nit we can check this in without that change. I'll leave it til tomorrow and commit then unless @andrewor14 beats me to it.

SparkQA · 2015-04-07T03:22:47Z

Test build #29775 has started for PR 5297 at commit 16c90a8.

SparkQA · 2015-04-07T03:44:30Z

Test build #29777 has started for PR 5297 at commit c76d232.

SparkQA · 2015-04-07T04:43:38Z

Test build #29775 has finished for PR 5297 at commit 16c90a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-07T04:43:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29775/
Test PASSed.

SparkQA · 2015-04-07T05:10:06Z

Test build #29777 has finished for PR 5297 at commit c76d232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-07T05:10:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29777/
Test PASSed.

fire and forget for YARN cluster mode

0cbdce8

WangTaoTheTonic changed the title ~~[SAPRK-3591][YARN]fire and forget for YARN cluster mode~~ [SPARK-3591][YARN]fire and forget for YARN cluster mode Mar 31, 2015

add a config to control whether to forget

19706c0

tgravescs reviewed Apr 1, 2015
View reviewed changes

revert to the old version and do some robust

e1a4013

WangTaoTheTonic force-pushed the SPARK-3591 branch from 591b752 to e1a4013 Compare April 2, 2015 02:18

wrong config name

ba9b22b

tgravescs reviewed Apr 2, 2015
View reviewed changes

log failed/killed report, style and comment

fea390d

WangTaoTheTonic force-pushed the SPARK-3591 branch from 9106da8 to fea390d Compare April 3, 2015 03:28

tgravescs reviewed Apr 3, 2015
View reviewed changes

move up lines to avoid duplicate

16c90a8

wrap lines

c76d232

asfgit closed this in b65bad6 Apr 7, 2015

[SPARK-3591][YARN]fire and forget for YARN cluster mode #5297

[SPARK-3591][YARN]fire and forget for YARN cluster mode #5297

Uh oh!

Conversation

WangTaoTheTonic commented Mar 31, 2015

Uh oh!

SparkQA commented Mar 31, 2015

Uh oh!

SparkQA commented Mar 31, 2015

Uh oh!

AmplabJenkins commented Mar 31, 2015

Uh oh!

WangTaoTheTonic commented Mar 31, 2015

Uh oh!

sryza commented Mar 31, 2015

Uh oh!

srowen commented Mar 31, 2015

Uh oh!

WangTaoTheTonic commented Mar 31, 2015

Uh oh!

SparkQA commented Apr 1, 2015

Uh oh!

SparkQA commented Apr 1, 2015

Uh oh!

AmplabJenkins commented Apr 1, 2015

Uh oh!

WangTaoTheTonic commented Apr 1, 2015

Uh oh!

SparkQA commented Apr 1, 2015

Uh oh!

SparkQA commented Apr 1, 2015

Uh oh!

AmplabJenkins commented Apr 1, 2015

Uh oh!

tgravescs Apr 1, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 1, 2015

Uh oh!

WangTaoTheTonic commented Apr 1, 2015

Uh oh!

tgravescs commented Apr 1, 2015

Uh oh!

SparkQA commented Apr 1, 2015

Uh oh!

AmplabJenkins commented Apr 1, 2015

Uh oh!

SparkQA commented Apr 2, 2015

Uh oh!

SparkQA commented Apr 2, 2015

Uh oh!

AmplabJenkins commented Apr 2, 2015

Uh oh!

SparkQA commented Apr 2, 2015

Uh oh!

AmplabJenkins commented Apr 2, 2015

Uh oh!

tgravescs Apr 2, 2015

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic commented Apr 3, 2015

Uh oh!

SparkQA commented Apr 3, 2015

Uh oh!

SparkQA commented Apr 3, 2015

Uh oh!

AmplabJenkins commented Apr 3, 2015

Uh oh!

andrewor14 commented Apr 3, 2015

Uh oh!

tgravescs commented Apr 3, 2015

Uh oh!

andrewor14 commented Apr 3, 2015

Uh oh!

tgravescs Apr 3, 2015

Choose a reason for hiding this comment

Uh oh!

sryza commented Apr 3, 2015

Uh oh!

tgravescs commented Apr 6, 2015