[SPARK-13947][SQL] The error message from using an invalid column reference is not clear #17100

rberenguel · 2017-02-28T13:46:07Z

What changes were proposed in this pull request?

Rewritten error message for clarity. Added extra information in case of attribute name collision, hinting the user to double-check referencing two different tables

How was this patch tested?

No functional changes, only final message has changed. It has been tested manually against the situation proposed in the JIRA ticket. Automated tests in repository pass.

This PR is original work from me and I license this work to the Spark project

…ion in case of attribute name collision

holdenk · 2017-02-28T18:59:29Z

Jenkins, ok to test.

holdenk · 2017-02-28T19:00:00Z

@rberenguel : how about adding the "[SQL]" tag to this, since while the feature request comes out of PySpark its changing the SQL code.

SparkQA · 2017-02-28T20:27:46Z

Test build #73604 has finished for PR 17100 at commit 981224b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rberenguel · 2017-03-01T11:20:58Z

@holdenk added! And I suspect I have some issue locally with coursier when running the tests... They supposedly work (as in, no outputted errors/failures?) 1/3 times, the other 2 cause coursier cache misses. But I think the no errors one is actually "I give up running the test suite anymore". I'll give a look at this (since it's pretty bad I can't rely on the local test suite!) and fix the failing test, is pretty straightforward.

…e. Added a mental note to disable coursier when running the full Spark test suite

SparkQA · 2017-03-01T19:34:51Z

Test build #73700 has finished for PR 17100 at commit ea07688.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…nalysisErrorSuite test that was failing. Just in case, removed the hardcoding of the names and hashes from this test

SparkQA · 2017-03-02T00:29:03Z

Test build #73714 has finished for PR 17100 at commit 65b9596.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks for working on this, maybe we can ask @marmbrus to take a look at this since its mostly a SQL change.

holdenk · 2017-03-06T16:24:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

            failAnalysis(
-              s"resolved attribute(s) $missingAttributes missing from $input " +
-                s"in operator ${operator.simpleString}")
+              s"|Some resolved attribute(s) are not present among available attributes " +


This is really long, would maybe """ + stripMargin be easier to write/read?

holdenk · 2017-03-06T16:24:19Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala


-    assertAnalysisError(plan, "resolved attribute(s) a#1L missing from a#2L" :: Nil)
+    assertAnalysisError(plan,
+                        "Some resolved attribute(s) are not present among available " +


This is a weird mixing of strings formats.

# Conflicts: # sql/core/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out

SparkQA · 2017-03-22T01:30:53Z

Test build #75013 has finished for PR 17100 at commit 7bb6f35.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-22T03:10:30Z

Can you remove [PYTHON] from the PR title? Thanks!

gatorsmile · 2017-03-22T03:21:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+
+            val repeatedNameHint = if (commonNames.size > 0) {
+              s"\n|Observe that attribute(s) ${commonNames.mkString(",")} appear in your " +
+                "query with at least two different hashes, but same name."


Users do not understand what hashes mean.

Good point!

gatorsmile · 2017-03-22T03:44:54Z

To detect duplicate names, you need to pass SQLConf conf in checkAnalysis for supporting the case sensitivity in name resolution. For each missing attribute, you can try using conf.resolver to find whether there exists a column from o.inputSet whose name is the same. After detecting the duplicate names, you can output the message to say the missing attributes are from a Dataset different from the input set.

rberenguel · 2017-05-02T23:06:34Z

@gatorsmile Thanks for the pointers, finally found some time to come back to this. I'm not sure if my approach to get the SQLConf into checkAnalysis is the correct one in my current local changes (since it seems to change a possible API endpoint). I changed the current implementation in the trait to be named instead def checkAnalysisWithConf(plan: LogicalPlan, conf: SQLConf): Unit and added an abstract method def checkAnalysis(plan: LogicalPlan): Unit that is then implemented in Analyzer (where we have a conf we can pass around). I haven't fixed all the rest yet, was puzzled enough with the correctness of this for now ;) Thanks

…just fine. Let’s see!

SparkQA · 2017-05-03T02:03:49Z

Test build #76402 has finished for PR 17100 at commit 766a033.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-03T03:54:55Z

Test build #76401 has finished for PR 17100 at commit 4ac8143.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-05-04T01:57:30Z

Test build #76434 has finished for PR 17100 at commit c2dfe11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rberenguel · 2017-05-14T19:25:01Z

Can you give a look to the changes @gatorsmile when you have a few spare moments? Also not sure how the conflict has appeared: it was merging cleanly when the CI test suite ran (will obviously fix, but first I want to confirm it's the right way of doing this now)

SparkQA · 2017-05-29T12:38:50Z

Test build #77502 has finished for PR 17100 at commit 0cb9825.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-10-10T20:07:21Z

re-ping @gatorsmile?

wzhfy · 2017-10-23T01:27:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

 import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf


unused import

wzhfy · 2017-10-23T01:39:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

          case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+            val resolver = plan.conf.resolver
+            val attrsWithSameName = o.missingInput.filter(x =>
+              o.inputSet.exists(y => resolver(x.name, y.name)))


nit:

val attrsWithSameName = o.missingInput.filter { missing => o.inputSet.exists(input => resolver(missing.name, input.name)) }

wzhfy · 2017-10-23T01:41:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

            val missingAttributes = o.missingInput.mkString(",")
-            val input = o.inputSet.mkString(",")
+            val availableAttributes = o.inputSet.mkString(",")
+            val repeatedNameHint = if (attrsWithSameName.size > 0) {


repeatedNameHint doesn't need missingAttributes or availableAttributes, could you move this definition above right after attrsWithSameName?

attrsWithSameName.size > 0 -> attrsWithSameName.nonEmpty

I don't understand where I should move missing or available, isn't it already just right after attrsWithSameName? But I moved it after repeatedNameHint, since it's needed only for the failAnalysis argument

Yea that' what I meant :)

wzhfy · 2017-10-23T01:49:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            val repeatedNameHint = if (attrsWithSameName.size > 0) {
+              val commonNames = attrsWithSameName.map(_.name).mkString(",")
+              s"""\n|Attribute(s) `$commonNames` seem to appear in two
+                  |different datasets, with the same name."""


The datasets concept is a little strange here, would inputs or input operators be better?

\n|Please check attribute(s) $commonNames, they seem to appear in two...

I have changed it. Agree datasets may not be the best word to define it

SparkQA · 2017-10-23T09:05:50Z

Test build #82970 has finished for PR 17100 at commit 6e8ab42.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-10-23T09:30:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+                |$missingAttributes is not in $availableAttributes.
+                |$repeatedNameHint
+                |The failed query was for operator
+                |${operator.simpleString}""".stripMargin)


Actually I agree with @viirya, the original error message is OK for me. If you really want to improve it, I think just adding repeatedNameHint at the end would be good enough:

s"Resolved attribute(s) $missingAttributes missing from $input in operator ${operator.simpleString}. $repeatedNameHint"

What do you think?

This is good for me, having the repeatedNameHint is the whole point, at the end.

the final message could be:

if (repeatedNameHint.nonEmpty) msg + “ ” + repeatedNameHint else msg

Hmmm seems reasonable, indeed. and avoids that horrible trailing newline. Thanks a lot :)

wzhfy · 2017-10-23T10:54:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

          case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+            val resolver = plan.conf.resolver
+            val attrsWithSameName = o.missingInput.filter(missing =>
+              o.inputSet.exists(input => resolver(missing.name, input.name)))


could you also change the format?

o.missingInput.filter { missing => o.inputSet.exists(input => resolver(missing.name, input.name)) }

Done. Do you know if there is any style rule for when to use parentheses vs braces in these kind of lambda functions?

Usually parentheses are used in one line, while curly braces are used in multi-line. For complete scala style guide, please check this: https://github.com/databricks/scala-style-guide

Thanks @wzhfy

wzhfy · 2017-10-23T10:55:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+
            val missingAttributes = o.missingInput.mkString(",")
-            val input = o.inputSet.mkString(",")
+            val availableAttributes = o.inputSet.mkString(",")


could you revert this change? original input is ok for me.

wzhfy · 2017-10-23T12:59:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+              val commonNames = attrsWithSameName.map(_.name).mkString(",")
+              s"""|Please check attribute(s) `$commonNames`, they seem to appear in two
+                  |different input operators, with the same name.""".stripMargin


The message two different could be wrong (such as the test case). How about changing the hint message:

Attribute(s) with the same name are found in the input: `$sameNames`. Please check if the right attribute(s) are used.

I have changed it, now looks better no matter how many appear

wzhfy · 2017-10-23T13:00:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+              o.inputSet.exists(input => resolver(missing.name, input.name))
+            }
+            val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+              val commonNames = attrsWithSameName.map(_.name).mkString(",")


commonNames -> sameNames?

wzhfy · 2017-10-23T13:05:15Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala

+    val attrA = AttributeReference("a", LongType)(exprId = ExprId(1))
+    val otherA = AttributeReference("a", LongType)(exprId = ExprId(2))
+    val bAlias = Alias(sum(attrA), "b")() :: Nil
+    val plan = Aggregate(


we can make the test case more strong by adding another attribute c which is missing from input but doesn't have the same name in input.

Added another attribute

wzhfy · 2017-10-23T13:05:56Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala

+                     |Please check attribute(s) `a`, they seem to appear in two
+                     |different input operators, with the same name.""".stripMargin
+
+


only one new line is enough here

SparkQA · 2017-10-23T14:34:42Z

Test build #82981 has finished for PR 17100 at commit e3cc455.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-23T14:41:12Z

Test build #82979 has finished for PR 17100 at commit 72b72cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-23T14:42:26Z

Test build #82980 has finished for PR 17100 at commit 33a8a71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-23T23:59:32Z

Test build #82994 has finished for PR 17100 at commit 34753b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-10-24T12:30:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+              val sameNames = attrsWithSameName.map(_.name).mkString(",")
+              s"""Attribute(s) with the same name appear in the operation: `$sameNames`.
+                  |Please check if the right attribute(s) are used.""".stripMargin


Personally I prefer one line message for repeatedNameHint and msg, because they are parameters of AnalysisException, but that's not a strong point.

I don't have a strong case for having two lines here, and I see the point of seeing it as one line. I guess the best way to have it as one line is splitting it into

val attributeRepetitionMsg = s"Attribute(s) with the same name appear in the operation: `$sameNames`" val checkAttributesMsg = s"Please check if the right attribute(s) are used."

to avoid hitting the < 100 chars linting rule?

wzhfy · 2017-10-24T12:31:19Z

LGTM except one minor comment. ping @gatorsmile

gatorsmile · 2017-10-24T16:50:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            }
+            val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+              val sameNames = attrsWithSameName.map(_.name).mkString(",")
+              s"""Attribute(s) with the same name appear in the operation: `$sameNames`.


Normally, we do not quote multiple column names in the same quote.

You mean wrapping each column name with the ticks?

Let us follow what we did for missingAttributes. No need to do it.

gatorsmile · 2017-10-24T17:03:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            val msg = s"""Resolved attribute(s) $missingAttributes missing from $input
+                          |in operator ${operator.simpleString}.""".stripMargin
+
+            failAnalysis(if (repeatedNameHint.nonEmpty) msg + "\n" + repeatedNameHint else msg)


How about

val missingAttributes = o.missingInput.mkString(",") val input = o.inputSet.mkString(",") val msgForMissingAttributes = s"Resolved attribute(s) $missingAttributes missing " + s"from $input in operator ${operator.simpleString}." val resolver = plan.conf.resolver val attrsWithSameName = o.missingInput.filter { missing => o.inputSet.exists(input => resolver(missing.name, input.name)) } val msg = if (attrsWithSameName.nonEmpty) { val sameNames = attrsWithSameName.map(_.name).mkString(",") s"$msgForMissingAttributes. Attribute(s) with the same name appear in the " + s"operation: $sameNames. Please check if the right attribute(s) are used." } else { msgForMissingAttributes } failAnalysis(msg)

Yup thanks, is more orderly this way

SparkQA · 2017-10-24T22:13:18Z

Test build #83020 has finished for PR 17100 at commit b8fbdea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-25T06:02:28Z

Thanks! Merged to master.

rberenguel · 2017-10-25T09:34:59Z

Thanks @gatorsmile @holdenk @viirya @wzhfy for all the help

holdenk · 2017-10-25T10:28:31Z

Congratulations on your change @rberenguel :D :)

gatorsmile · 2017-10-25T19:26:37Z

@rberenguel Thank you for your contribution!

SPARK-13947 Rewritten error message for clarity. Added extra informat…

981224b

…ion in case of attribute name collision

rberenguel changed the title ~~[SPARK-13947][PYTHON] PySpark DataFrames: The error message from using an invalid table reference is not clear~~ [SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error message from using an invalid table reference is not clear Mar 1, 2017

SPARK-13947 Regenerated the golden file with the updated error messag…

ea07688

…e. Added a mental note to disable coursier when running the full Spark test suite

SPARK-13947 Regenerated golden file after formatting fix. Fixed the A…

65b9596

…nalysisErrorSuite test that was failing. Just in case, removed the hardcoding of the names and hashes from this test

holdenk reviewed Mar 6, 2017

View reviewed changes

Ruben Berenguel Montoro added 2 commits March 22, 2017 00:19

SPARK-13947 Code review improvements

c99229c

Merge branch 'master' into SPARK-13947-error-message

7bb6f35

# Conflicts: # sql/core/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out

gatorsmile reviewed Mar 22, 2017

View reviewed changes

rberenguel changed the title ~~[SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error message from using an invalid table reference is not clear~~ [SPARK-13947][SQL] PySpark DataFrames: The error message from using an invalid table reference is not clear Mar 22, 2017

Ruben Berenguel Montoro and others added 2 commits May 3, 2017 01:26

SPARK-13947 Not sure about these changes, but passes the local suite …

4ac8143

…just fine. Let’s see!

Merge branch 'master' into SPARK-13947-error-message

766a033

SPARK-13947 Never quick-edit a non-clean merge after 1 AM

c2dfe11

rberenguel added 2 commits May 29, 2017 11:19

Merge branch 'master' into SPARK-13947-error-message

dcf0b2c

Merge branch 'master' into SPARK-13947-error-message

0cb9825

wzhfy reviewed Oct 23, 2017

View reviewed changes

SPARK-13947 Another batch of code review modifications

6e8ab42

wzhfy reviewed Oct 23, 2017

View reviewed changes

rberenguel added 3 commits October 23, 2017 12:04

SPARK-13947 Another batch of code review modifications

72b72cf

SPARK-13947 Format change for a lambda

33a8a71

SPARK-13947 Improve the final formatting

e3cc455

wzhfy reviewed Oct 23, 2017

View reviewed changes

SPARK-13947 Test and message improvements

34753b5

wzhfy reviewed Oct 24, 2017

View reviewed changes

gatorsmile reviewed Oct 24, 2017

View reviewed changes

SPARK-13947 More code review changes

b8fbdea

asfgit closed this in 427359f Oct 25, 2017

rberenguel deleted the SPARK-13947-error-message branch October 25, 2017 09:34

		\|Please check attribute(s) `a`, they seem to appear in two
		\|different input operators, with the same name.""".stripMargin

[SPARK-13947][SQL] The error message from using an invalid column reference is not clear #17100

[SPARK-13947][SQL] The error message from using an invalid column reference is not clear #17100

Uh oh!

Conversation

rberenguel commented Feb 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk commented Feb 28, 2017

Uh oh!

holdenk commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

rberenguel commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2017

Uh oh!

gatorsmile commented Mar 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 22, 2017

Uh oh!

rberenguel commented May 2, 2017

Uh oh!

SparkQA commented May 3, 2017

Uh oh!

SparkQA commented May 3, 2017

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

rberenguel commented May 14, 2017

Uh oh!

SparkQA commented May 29, 2017

Uh oh!

holdenk commented Oct 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Oct 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wzhfy Oct 23, 2017 •

edited

Loading

wzhfy Oct 23, 2017 •

edited

Loading

wzhfy Oct 23, 2017 •

edited

Loading