[SPARK-29283][SQL] Error message is hidden when query from JDBC, especially enabled adaptive execution #25960

LantaoJin · 2019-09-29T06:45:27Z

What changes were proposed in this pull request?

When adaptive execution is enabled, the Spark users who connected from JDBC always get adaptive execution error whatever the under root cause is. It's very confused. We have to check the driver log to find out why.

0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v;
SELECT * FROM testData join testData2 ON key = v;
Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0)
0: jdbc:hive2://localhost:10000>

For example, a job queried from JDBC failed due to HDFS missing block. User still get the error message Adaptive execution failed due to stage materialization failures.

The easiest way to reproduce is changing the code of AdaptiveSparkPlanExec, to let it throws out an exception when it faces StageSuccess.

  case class AdaptiveSparkPlanExec(
      events.drainTo(rem)
         (Seq(nextMsg) ++ rem.asScala).foreach {
           case StageSuccess(stage, res) =>
//            stage.resultOption = Some(res)
            val ex = new SparkException("Wrapper Exception",
              new IllegalArgumentException("Root cause is IllegalArgumentException for Test"))
            errors.append(
              new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex))
           case StageFailure(stage, ex) =>
             errors.append(
               new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex))

Why are the changes needed?

To make the error message more user-friend and more useful for query from JDBC.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually test query:

0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string);
CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string);
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (0.225 seconds)
0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int);
CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int);
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (0.043 seconds)

Before:

0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v;
SELECT * FROM testData join testData2 ON key = v;
Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0)
0: jdbc:hive2://localhost:10000>

After:

0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v;
SELECT * FROM testData join testData2 ON key = v;
Error: Error running query: java.lang.IllegalArgumentException: Root cause is IllegalArgumentException for Test (state=,code=0)
0: jdbc:hive2://localhost:10000>

…cially enabled adaptive execution

SparkQA · 2019-09-29T07:05:02Z

Test build #111555 has finished for PR 25960 at commit ffed088.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2019-09-29T07:22:35Z

Retest this please.

SparkQA · 2019-09-29T09:53:50Z

Test build #111558 has finished for PR 25960 at commit ffed088.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-09-30T02:05:15Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

          } else {
-            throw new HiveSQLException("Error running query: " + e.toString, e)
+            throw new HiveSQLException("Error running query: " +
+              SparkUtils.findFirstCause(e).toString, e)


SparkUtils.findFirstCause(e) -> org.apache.commons.lang3.exception.ExceptionUtils.getRootCause(e)?

https://github.com/apache/commons-lang/blob/LANG_3_8_1/src/main/java/org/apache/commons/lang3/exception/ExceptionUtils.java#L167-L187

cc @juliuszsompolski @srowen

SparkQA · 2019-09-30T09:08:13Z

Test build #111603 has finished for PR 25960 at commit 7d55615.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski · 2019-10-01T08:16:09Z

For consistency, should we do that in all Spark*Operation?
I.e. replace the current

      case e: HiveSQLException =>
        setState(OperationState.ERROR)
        HiveThriftServer2.listener.onStatementError(
          statementId, e.getMessage, SparkUtils.exceptionString(e))
        throw e

with

  case e: Throwable =>
    logError(s"Error executing operation with $statementId, currentState $currentState, ", e)
    setState(OperationState.ERROR)
    HiveThriftServer2.listener.onStatementError(
      statementId, e.getMessage, SparkUtils.exceptionString(e))
    if (e.isInstanceOf[HiveSQLException]) {
      throw e.asInstanceOf[HiveSQLException]
    } else {
      val root = ExceptionUtils.getRootCause(e)
      throw new HiveSQLException("Error running query: " +
        (if (root == null) e.toString else root.toString), e)
    }

in all of them?

LantaoJin · 2019-10-02T02:19:21Z

@juliuszsompolski fixed.

maropu · 2019-10-02T02:39:58Z

...server/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetCatalogsOperation.scala

      setState(OperationState.FINISHED)
    } catch {
-      case e: HiveSQLException =>
+      case e: Throwable =>


Hm. I think we may want to catch a Throwable.
E.g. InterruptedExpression is not catched by NonFatal, and we want to inform the HiveThriftServer2.listener about the error after an interrupt - this definitely can happen in SparkExecuteStatementOperation that is async and can be cancelled. After a ThreadDeath of OutOfMemoryError I think we also want to inform the HiveThriftServer2.listener to not get the query hanging in the UI, as I think the server would continue to go on (I think it won't bring the whole JVM down?).

If so, we should list up InterruptedExpression here, too? IIUC the reason why we mainly use NonFatal in this case is not to catch NonLocalReturnControl. But, yea, this is not my area, so I think @wangyum could suggest more about this.

+1 for Throwable.

Extractor of non-fatal Throwables. Will not match fatal errors like VirtualMachineError.
(for example, OutOfMemoryError and StackOverflowError, subclasses of VirtualMachineError), ThreadDeath, LinkageError, InterruptedException, ControlThrowable.

https://github.com/scala/scala/blob/v2.12.10/src/library/scala/util/control/NonFatal.scala#L17-L19

SparkQA · 2019-10-02T02:42:27Z

Test build #111664 has finished for PR 25960 at commit 2acb51a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-10-08T08:52:50Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

            throw e.asInstanceOf[HiveSQLException]
          } else {
-            throw new HiveSQLException("Error running query: " + e.toString, e)
+            val root = ExceptionUtils.getRootCause(e)


Could we change it to?

setState(OperationState.ERROR) e match { case hiveException: HiveSQLException => logError(s"Error executing query with $statementId, currentState $currentState, ", e) HiveThriftServer2.listener.onStatementError( statementId, hiveException.getMessage, SparkUtils.exceptionString(hiveException)) throw hiveException case _ => val rootCause = Option(ExceptionUtils.getRootCause(e)).getOrElse(e) logError( s"Error executing query with $statementId, currentState $currentState, ", rootCause) HiveThriftServer2.listener.onStatementError( statementId, rootCause.getMessage, SparkUtils.exceptionString(rootCause)) throw new HiveSQLException("Error running query: " + rootCause.toString, rootCause) }

val rootCause = Option(ExceptionUtils.getRootCause(e)).getOrElse(e)

Return null only if the input e is null. Do we still add this option?

Besides the null checker, I've changed the code to above style. @wangyum

LantaoJin · 2019-10-09T07:23:34Z

Retest this please.

LantaoJin · 2019-10-09T12:23:43Z

Retest this please.

LantaoJin · 2019-10-12T03:58:48Z

Retest this please.

SparkQA · 2019-10-12T04:25:56Z

Test build #111949 has finished for PR 25960 at commit 4aa2dd2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2019-10-12T08:12:56Z

The UT could passed after #26028 merged.

wangyum · 2019-10-13T05:19:41Z

retest this please

SparkQA · 2019-10-13T05:43:26Z

Test build #111991 has finished for PR 25960 at commit 4aa2dd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2019-10-14T01:48:26Z

Retest this please.

SparkQA · 2019-10-14T02:13:39Z

Test build #112002 has finished for PR 25960 at commit 4aa2dd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum

LGTM cc @juliuszsompolski @srowen @maropu

srowen

I guess it's hard to refactor this error handling vs copying it? seems OK.

juliuszsompolski · 2019-10-16T17:16:19Z

LGTM.
@srowen yeah, some common logic could be refactored out into a mixin trait (because direct inheritence goes from different Hive Operation implementations), but I think it's a bigger refactor change than the scope of this PR.

…cially enabled adaptive execution ### What changes were proposed in this pull request? When adaptive execution is enabled, the Spark users who connected from JDBC always get adaptive execution error whatever the under root cause is. It's very confused. We have to check the driver log to find out why. ```shell 0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v; SELECT * FROM testData join testData2 ON key = v; Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0) 0: jdbc:hive2://localhost:10000> ``` For example, a job queried from JDBC failed due to HDFS missing block. User still get the error message `Adaptive execution failed due to stage materialization failures`. The easiest way to reproduce is changing the code of `AdaptiveSparkPlanExec`, to let it throws out an exception when it faces `StageSuccess`. ```scala case class AdaptiveSparkPlanExec( events.drainTo(rem) (Seq(nextMsg) ++ rem.asScala).foreach { case StageSuccess(stage, res) => // stage.resultOption = Some(res) val ex = new SparkException("Wrapper Exception", new IllegalArgumentException("Root cause is IllegalArgumentException for Test")) errors.append( new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex)) case StageFailure(stage, ex) => errors.append( new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex)) ``` ### Why are the changes needed? To make the error message more user-friend and more useful for query from JDBC. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually test query: ```shell 0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string); CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string); +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.225 seconds) 0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int); CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int); +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.043 seconds) ``` Before: ```shell 0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v; SELECT * FROM testData join testData2 ON key = v; Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0) 0: jdbc:hive2://localhost:10000> ``` After: ```shell 0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v; SELECT * FROM testData join testData2 ON key = v; Error: Error running query: java.lang.IllegalArgumentException: Root cause is IllegalArgumentException for Test (state=,code=0) 0: jdbc:hive2://localhost:10000> ``` Closes #25960 from LantaoJin/SPARK-29283. Authored-by: lajin <lajin@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com> (cherry picked from commit fda4070) Signed-off-by: Yuming Wang <wgyumg@gmail.com>

wangyum · 2019-10-17T02:55:09Z

Merged to master and branch-3.0-preview.

[SPARK-29283][SQL] Error message is hidden when query from JDBC, espe…

ffed088

…cially enabled adaptive execution

dongjoon-hyun added the SQL label Sep 29, 2019

wangyum reviewed Sep 30, 2019

View reviewed changes

address comment

7d55615

srowen approved these changes Sep 30, 2019

View reviewed changes

maropu approved these changes Oct 1, 2019

View reviewed changes

replace all Spark*Operation

2acb51a

LantaoJin requested a review from wangyum October 2, 2019 02:19

maropu reviewed Oct 2, 2019

View reviewed changes

wangyum reviewed Oct 8, 2019

View reviewed changes

address comment

4aa2dd2

wangyum approved these changes Oct 16, 2019

View reviewed changes

srowen reviewed Oct 16, 2019

View reviewed changes

wangyum closed this in fda4070 Oct 17, 2019

juliuszsompolski mentioned this pull request Jul 1, 2020

[SPARK-32145][SQL][test-hive1.2][test-hadoop2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message #28963

Closed

[SPARK-29283][SQL] Error message is hidden when query from JDBC, especially enabled adaptive execution #25960

[SPARK-29283][SQL] Error message is hidden when query from JDBC, especially enabled adaptive execution #25960

Uh oh!

Conversation

LantaoJin commented Sep 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 29, 2019

Uh oh!

LantaoJin commented Sep 29, 2019

Uh oh!

SparkQA commented Sep 29, 2019

Uh oh!

wangyum Sep 30, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 30, 2019

Uh oh!

juliuszsompolski commented Oct 1, 2019

Uh oh!

LantaoJin commented Oct 2, 2019

Uh oh!

maropu Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Oct 3, 2019

Choose a reason for hiding this comment

Uh oh!

wangyum Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 2, 2019

Uh oh!

wangyum Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

LantaoJin Oct 9, 2019

Choose a reason for hiding this comment

Uh oh!

LantaoJin Oct 9, 2019

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Oct 9, 2019

Uh oh!

LantaoJin commented Oct 9, 2019

Uh oh!

LantaoJin commented Oct 12, 2019

Uh oh!

SparkQA commented Oct 12, 2019

Uh oh!

LantaoJin commented Oct 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyum commented Oct 13, 2019

Uh oh!

SparkQA commented Oct 13, 2019

Uh oh!

LantaoJin commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

wangyum left a comment

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Oct 16, 2019

Uh oh!

wangyum commented Oct 17, 2019

Uh oh!

Reviewers

LantaoJin commented Sep 29, 2019 •

edited

Loading

juliuszsompolski Oct 2, 2019 •

edited

Loading

LantaoJin commented Oct 12, 2019 •

edited

Loading