[SPARK-24182][yarn] Improve error message when client AM fails. #21243

vanzin · 2018-05-04T23:11:11Z

Instead of always throwing a generic exception when the AM fails,
print a generic error and throw the exception with the YARN
diagnostics containing the reason for the failure.

There was an issue with YARN sometimes providing a generic diagnostic
message, even though the AM provides a failure reason when
unregistering. That was happening because the AM was registering
too late, and if errors happened before the registration, YARN would
just create a generic "ExitCodeException" which wasn't very helpful.

Since most errors in this path are a result of not being able to
connect to the driver, this change modifies the AM registration
a bit so that the AM is registered before the connection to the
driver is established. That way, errors are properly propagated
through YARN back to the driver.

As part of that, I also removed the code that retried connections
to the driver from the client AM. At that point, the driver should
already be up and waiting for connections, so it's unlikely that
retrying would help - and in case it does, that means a flaky
network, which would mean problems would probably show up again.
The effect of that is that connection-related errors are reported
back to the driver much faster now (through the YARN report).

One thing to note is that there seems to be a race on the YARN
side that causes a report to be sent to the client without the
corresponding diagnostics string from the AM; the diagnostics are
available later from the RM web page. For that reason, the generic
error messages are kept in the Spark scheduler code, to help
guide users to a way of debugging their failure.

Also of note is that if YARN's max attempts configuration is lower
than Spark's, Spark will not unregister the AM with a proper
diagnostics message. Unfortunately there seems to be no way to
unregister the AM and still allow further re-attempts to happen.

Testing:

existing unit tests
some of our integration tests
hardcoded an invalid driver address in the code and verified
the error in the shell. e.g.

scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details.
18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
  <AM stack trace>
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:1234
  <More stack trace>

Instead of always throwing a generic exception when the AM fails, print a generic error and throw the exception with the YARN diagnostics containing the reason for the failure. There was an issue with YARN sometimes providing a generic diagnostic message, even though the AM provides a failure reason when unregistering. That was happening because the AM was registering too late, and if errors happened before the registration, YARN would just create a generic "ExitCodeException" which wasn't very helpful. Since most errors in this path are a result of not being able to connect to the driver, this change modifies the AM registration a bit so that the AM is registered before the connection to the driver is established. That way, errors are properly propagated through YARN back to the driver. As part of that, I also removed the code that retried connections to the driver from the client AM. At that point, the driver should already be up and waiting for connections, so it's unlikely that retrying would help - and in case it does, that means a flaky network, which would mean problems would probably show up again. The effect of that is that connection-related errors are reported back to the driver much faster now (through the YARN report). One thing to note is that there seems to be a race on the YARN side that causes a report to be sent to the client without the corresponding diagnostics string from the AM; the diagnostics are available later from the RM web page. For that reason, the generic error messages are kept in the Spark scheduler code, to help guide users to a way of debugging their failure. Also of note is that if YARN's max attempts configuration is lower than Spark's, Spark will not unregister the AM with a proper diagnostics message. Unfortunately there seems to be no way to unregister the AM and still allow further re-attempts to happen. Testing: - existing unit tests - some of our integration tests - hardcoded an invalid driver address in the code and verified the error in the shell. e.g. ``` scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult: <AM stack trace> Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:1234 <More stack trace> ```

vanzin · 2018-05-04T23:11:38Z

@tgravescs @jerryshao

SparkQA · 2018-05-04T23:33:17Z

Test build #90230 has finished for PR 21243 at commit a8c223d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-05-08T06:42:00Z

What kind of exceptions will client AM meet usually? I think the logic is quite simple for client AM, just wondering what kind of issue will it meet.

jerryshao · 2018-05-08T07:03:42Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+    registered = true
+  }
+
+  private def createAllocator(driverRef: RpcEndpointRef, _sparkConf: SparkConf): Unit = {


What is the purpose of separating into two methods? Sorry I cannot get the point.

Explained in the PR description. YARN will create a non-helpful error message if an error happens before the AM is registered. This moves registration of the AM to an earlier spot.

jerryshao · 2018-05-08T07:05:07Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

-        state == YarnApplicationState.FAILED ||
-        state == YarnApplicationState.KILLED) {
+          state == YarnApplicationState.FAILED ||
+          state == YarnApplicationState.KILLED) {


Nit: is here 4 space or 2 space indent?

Continuation lines of conditions are generally double-indented (to clearly separate them from the rest of the code).

jerryshao · 2018-05-08T07:33:00Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

      if (!finished) {
        val inShutdown = ShutdownHookManager.inShutdown()
-        if (registered) {
+        if (registered || !isClusterMode) {


Why would we need to add non-cluster mode check here?

Because otherwise the client mode AM will exit with "EXIT_SC_NOT_INITED" in certain cases, which doesn't really make a lot of sense.

I see, thanks for the explain.

vanzin · 2018-05-08T16:54:10Z

What kind of exceptions will client AM meet usually?

We see a non-trivial amount of people running into connection issues between the AM and the driver. It's typically a firewall issue or something of the sort, but because the error message is completely non-helpful, they end up calling support.

jerryshao

Just did another round of review. LGTM.

jerryshao · 2018-05-11T09:10:50Z

Jenkins, retest this please.

SparkQA · 2018-05-11T09:32:45Z

Test build #90501 has finished for PR 21243 at commit a8c223d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-05-11T09:38:05Z

Merging to master branch.

Instead of always throwing a generic exception when the AM fails, print a generic error and throw the exception with the YARN diagnostics containing the reason for the failure. There was an issue with YARN sometimes providing a generic diagnostic message, even though the AM provides a failure reason when unregistering. That was happening because the AM was registering too late, and if errors happened before the registration, YARN would just create a generic "ExitCodeException" which wasn't very helpful. Since most errors in this path are a result of not being able to connect to the driver, this change modifies the AM registration a bit so that the AM is registered before the connection to the driver is established. That way, errors are properly propagated through YARN back to the driver. As part of that, I also removed the code that retried connections to the driver from the client AM. At that point, the driver should already be up and waiting for connections, so it's unlikely that retrying would help - and in case it does, that means a flaky network, which would mean problems would probably show up again. The effect of that is that connection-related errors are reported back to the driver much faster now (through the YARN report). One thing to note is that there seems to be a race on the YARN side that causes a report to be sent to the client without the corresponding diagnostics string from the AM; the diagnostics are available later from the RM web page. For that reason, the generic error messages are kept in the Spark scheduler code, to help guide users to a way of debugging their failure. Also of note is that if YARN's max attempts configuration is lower than Spark's, Spark will not unregister the AM with a proper diagnostics message. Unfortunately there seems to be no way to unregister the AM and still allow further re-attempts to happen. Testing: - existing unit tests - some of our integration tests - hardcoded an invalid driver address in the code and verified the error in the shell. e.g. ``` scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult: <AM stack trace> Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:1234 <More stack trace> ``` Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21243 from vanzin/SPARK-24182.

advancedxy · 2019-08-13T08:40:22Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala

-      driverUrl: String,
-      driverRef: RpcEndpointRef,
+      driverHost: String,
+      driverPort: Int,


Hi, @vanzin during our internal porting, we found this parameter is misleading.

It should be amHost and amRpcPort to be more accurate.
When running on client mode, the value passed here is ApplicationMaster rather than driver.
Do you think it's worth another Jira to resolve this issue?

jerryshao reviewed May 8, 2018

View reviewed changes

jerryshao approved these changes May 9, 2018

View reviewed changes

asfgit closed this in 5403268 May 11, 2018

vanzin deleted the SPARK-24182 branch May 18, 2018 21:28

wangyum mentioned this pull request Jul 16, 2018

[SPARK-24873][YARN] Turn off spark-shell noisy log output #21784

Closed

advancedxy reviewed Aug 13, 2019

View reviewed changes

[SPARK-24182][yarn] Improve error message when client AM fails. #21243

[SPARK-24182][yarn] Improve error message when client AM fails. #21243

Uh oh!

Conversation

vanzin commented May 4, 2018

Uh oh!

vanzin commented May 4, 2018

Uh oh!

SparkQA commented May 4, 2018

Uh oh!

jerryshao commented May 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented May 8, 2018

Uh oh!

jerryshao left a comment

Choose a reason for hiding this comment

Uh oh!

jerryshao commented May 11, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

jerryshao commented May 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants