Skip to content

Conversation

@WangTaoTheTonic
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-3591

The output after this patch:

doggie153:/opt/oss/spark-1.3.0-bin-hadoop2.4/bin # ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster ../lib/spark-examples*.jar
15/03/31 21:15:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/31 21:15:25 INFO RMProxy: Connecting to ResourceManager at doggie153/10.177.112.153:8032
15/03/31 21:15:25 INFO Client: Requesting a new application from cluster with 4 NodeManagers
15/03/31 21:15:25 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/03/31 21:15:25 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/03/31 21:15:25 INFO Client: Setting up container launch context for our AM
15/03/31 21:15:25 INFO Client: Preparing resources for our AM container
15/03/31 21:15:26 INFO Client: Uploading resource file:/opt/oss/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.4.1.jar -> hdfs://doggie153:9000/user/root/.sparkStaging/application_1427257505534_0016/spark-assembly-1.4.0-SNAPSHOT-hadoop2.4.1.jar
15/03/31 21:15:27 INFO Client: Uploading resource file:/opt/oss/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://doggie153:9000/user/root/.sparkStaging/application_1427257505534_0016/spark-examples-1.3.0-hadoop2.4.0.jar
15/03/31 21:15:28 INFO Client: Setting up the launch environment for our AM container
15/03/31 21:15:28 INFO SecurityManager: Changing view acls to: root
15/03/31 21:15:28 INFO SecurityManager: Changing modify acls to: root
15/03/31 21:15:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/03/31 21:15:28 INFO Client: Submitting application 16 to ResourceManager
15/03/31 21:15:28 INFO YarnClientImpl: Submitted application application_1427257505534_0016
15/03/31 21:15:28 INFO Client: ... waiting before polling ResourceManager for application state
15/03/31 21:15:33 INFO Client: ... polling ResourceManager for application state
15/03/31 21:15:33 INFO Client: Application report for application_1427257505534_0016 (state: RUNNING)
15/03/31 21:15:33 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: doggie157
ApplicationMaster RPC port: 0
queue: default
start time: 1427807728307
final status: UNDEFINED
tracking URL: http://doggie153:8088/proxy/application_1427257505534_0016/
user: root

/cc @andrewor14

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29482 has started for PR 5297 at commit 0cbdce8.

@WangTaoTheTonic WangTaoTheTonic changed the title [SAPRK-3591][YARN]fire and forget for YARN cluster mode [SPARK-3591][YARN]fire and forget for YARN cluster mode Mar 31, 2015
@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29482 has finished for PR 5297 at commit 0cbdce8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateStruct(children: Seq[NamedExpression]) extends Expression
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29482/
Test FAILed.

@WangTaoTheTonic
Copy link
Contributor Author

I will look up into the test cases and fix them.

@sryza
Copy link
Contributor

sryza commented Mar 31, 2015

If somebody has an existing script that runs spark-submit and then follows it with an action that expects the app to have completed, this would break it, right?

I think we might need to expose this as a config option instead of changing the default behavior.

@srowen
Copy link
Member

srowen commented Mar 31, 2015

Is this that different from just backgrounding the process? which can still be monitored, produces logs, etc.

@WangTaoTheTonic
Copy link
Contributor Author

@sryza Yeah I'd like to keep it compatible too. In standalone cluster mode there seems no config option like this. Maybe we should add it there too.

@srowen Backgrouding means there still exist the client process. "fire and forget" means client just submit the application, then exit.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29519 has started for PR 5297 at commit 19706c0.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29519 has finished for PR 5297 at commit 19706c0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29519/
Test FAILed.

@WangTaoTheTonic
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29520 has started for PR 5297 at commit 19706c0.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29520 has finished for PR 5297 at commit 19706c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29520/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather see this combined with monitorapplication. monitorapplication already supports a returnOnRunning option that I think we can use. There is a waitForApplication in the yarnClientSchedulerBackend.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29547 has started for PR 5297 at commit 591b752.

@WangTaoTheTonic
Copy link
Contributor Author

@tgravescs Thanks for comments.

Now I reuse monitorapplication and the client process will exist after application begin running, which is a little different with fire-and-forget and I think the difference is tiny enough to be give another thought.

I'd rather leave the option for spark-submit to be added by the people who is eager to.

In case the comments by @sryza (about naming) might be ignored as the code change. I paste them here:

sryza:

What do you think of spark.yarn.client.waitForCompletion? Not a strong preference, but spark.yarn.waitForCompletion is a little ambiguous on who is waiting for who.

WangTaoTheTonic:

would spark.yarn.client.waitForCompletion make client process mixed up with client mode?
Perhaps spark.yarn.waitAppCompletion is better?

@tgravescs
Copy link
Contributor

That is a good point, using monitorApplication actually waits for running and that isn't necessarily what we want here. For instance it could be blocked on running waiting on resources. So I apologize, what you had before is better.

So a few comments on the old version.

  • we don't need to do the sleep. We use the YarnClientImpl.submitApplication which will handle that and even has a config for configuring the poll interval.
  • we should check the applicationReport you get to make sure the application status isn't failed or killed. If they are failed or killed we should make sure to throw so we exit with proper code

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29547 has finished for PR 5297 at commit 591b752.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29547/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29580 has started for PR 5297 at commit ba9b22b.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29580 has finished for PR 5297 at commit ba9b22b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29580/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29611 has finished for PR 5297 at commit 9106da8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29611/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out in a few cases and I noticed if we failed to submit there is no message telling the user why. we need print the report in the case the state failed/killed too. Make sure the diagnostics message is there.

@WangTaoTheTonic
Copy link
Contributor Author

@tgravescs IIUC the output diagnostics message should be:

15/04/03 11:18:47 INFO YarnClientImpl: Submitted application application_1427257505534_0029
15/04/03 11:18:57 INFO Client:
client token: N/A
diagnostics: Application killed by user.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428031127858
final status: KILLED
tracking URL: http://doggie153:8088/cluster/app/application_1427257505534_0029
user: root
Exception in thread "main" org.apache.spark.SparkException: Application application_1427257505534_0029 finished with status: KILLED
at org.apache.spark.deploy.yarn.Client.run(Client.scala:632)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:666)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Note: In order to produce the killed case, I add Thead.sleep(10000) in test code. Hope the interval in log will not cause any mislead.

@SparkQA
Copy link

SparkQA commented Apr 3, 2015

Test build #29642 has started for PR 5297 at commit fea390d.

@SparkQA
Copy link

SparkQA commented Apr 3, 2015

Test build #29642 has finished for PR 5297 at commit fea390d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29642/
Test PASSed.

@andrewor14
Copy link
Contributor

@tgravescs @WangTaoTheTonic I wonder whether we should just use monitorApplication(appId, returnOnRunning = true). It is fire-and-forget to the extent that once we know that an application is running or failed we exit immediately. This is also what we do in standalone cluster mode. Otherwise, I think just printing the first application report is not particularly useful, since we don't know whether the application is actually running.

An argument against doing returnOnRunning, however, is that if the cluster doesn't have enough resources we could theoretically wait an infinite amount of time before exiting. Though even before this patch YARN cluster mode already behaves like that, so doing this is still a strictly better addition. I believe standalone cluster mode has the same behavior. Any thoughts?

@tgravescs
Copy link
Contributor

I prefer the current way (not waiting for running). The YarnClient we are using makes sure it gets submitted before returning. Beyond that we shouldn't really have to wait for it to actually run. If your cluster is busy that could be an indefinite amount of time. In many cases I just want to start an application and I don't really care if it actually starts running immediately or in 10 minutes. The first application report is useful if it fails. For instance I submit to a queue that doesn't exist or have some bad config.

Not that this matters to much, but MapReduce also has a submit and no wait option and it doesn't wait for running, just submitted. I only mention it because if people are coming from that they might have expectations on the behavior.

@andrewor14
Copy link
Contributor

I see, then this LGTM pending @tgravescs' other comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of a nit but we could just move line 633-634 up before this if statement and we wouldn't need this second logInfo(formatReportDetails(report))

@sryza
Copy link
Contributor

sryza commented Apr 3, 2015

LGTM as well

@tgravescs
Copy link
Contributor

since my only other comment was a nit we can check this in without that change. I'll leave it til tomorrow and commit then unless @andrewor14 beats me to it.

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29775 has started for PR 5297 at commit 16c90a8.

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29777 has started for PR 5297 at commit c76d232.

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29775 has finished for PR 5297 at commit 16c90a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29775/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29777 has finished for PR 5297 at commit c76d232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29777/
Test PASSed.

@asfgit asfgit closed this in b65bad6 Apr 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants