Skip to content

Conversation

@anchovYu
Copy link
Contributor

@anchovYu anchovYu commented Apr 18, 2022

What changes were proposed in this pull request?

Improve the error messages for cast failures in ANSI.
As mentioned in https://issues.apache.org/jira/browse/SPARK-38929, this PR targets two cast-to types: numeric types and date types.

  • For numeric(int, smallint, double, float, decimal ..) types, it embeds the cast-to types in the error message. For example,
    Invalid input value for type INT: '1.0'. To return NULL instead, use 'try_cast'. If necessary set %s to false to bypass this error.
    
    It uses the toSQLType and toSQLValue to wrap the corresponding types and literals.
  • For date types, it does similarly as above. For example,
    Invalid input value for type TIMESTAMP: 'a'. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
    

Why are the changes needed?

To improve the error message in general.

Does this PR introduce any user-facing change?

It changes the error messages.

How was this patch tested?

The related unit tests are updated.

@anchovYu
Copy link
Contributor Author

} catch {
case e: NumberFormatException =>
throw QueryExecutionErrors.invalidInputSyntaxForNumericError(e, errorContext)
throw QueryExecutionErrors.invalidInputSyntaxForNumericError(to, s, errorContext)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it doesn't use the error message in e coming from toLongExact, but uses the error message in error-classes.json for better error message organization and grouping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the side effect is, it throws the org.apache.spark.SparkNumberFormatException instead of the java.lang.NumberFormatException. The former is a subclass of the latter, so the code catching the latter will still work with the updated version. Will it be a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine

}

def invalidInputSyntaxForNumericError(
to: AbstractDataType,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether there is a better way than this AbstractDataType and corresponding tosimpleString method, considering the DecimalType. Since the cast fails, we can't get a concrete DecimalType class with scale and precision, so I just can't use the DataType here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anchovYu
Copy link
Contributor Author

Hi @cloud-fan and @MaxGekk , could you take a look at this one? Thank you!

"message" : [ "Input schema %s can only contain StringType as a key type for a MapType." ]
},
"INVALID_LITERAL_FORMAT_FOR_CAST" : {
"message" : [ "Invalid %s literal: %s. To return NULL instead, use 'try_cast'. If necessary set %s to false to bypass this error.%s" ],
Copy link
Contributor

@cloud-fan cloud-fan Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cast input always a literal? I think Invalid input value for type %s: %s. To return NULL ... is better

} catch {
case _: NumberFormatException =>
throw QueryExecutionErrors.invalidInputSyntaxForNumericError(str, errorContext)
throw QueryExecutionErrors.invalidInputSyntaxForNumericError(DecimalType, str, errorContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we call fromStringANSI? There should be a concrete DecimalType available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do it initially, because I think fromStringANSI and changePrecision are two separate processes and has its corresponding exceptions. The former has an exception to cast from a string to a BigDecimal, and at this step, the target DecimalType doesn't appear. While the latter has exception to change precision, where target starts to take place.

But maybe even though these are two steps, they are always called together serially. Thus, having the target information in the first stage seems fine. It also saves a lot of trouble in AbstractDataType etc. Will update the PR.

case StringType =>
val doubleStr = ctx.freshVariable("doubleStr", StringType)
(c, evPrim, evNull) =>
val dt = ctx.addReferenceObj("doubleType", DoubleType, DoubleType.getClass.getName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overkill. Let's simply hardcode the data type name in the method invalidInputSyntaxForNumericError like #36244

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forget to remove this line .. the dt is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW according to the comment from Maxim below, I believe we still need to accept a DataType or AbstractDataType instead of a pure string as a parameter to invalidInputSyntaxForNumericError, so that we can utilize the toSQLType to format

@gengliangwang
Copy link
Member

@anchovYu I am sorry that I didn't notice you start this one. I realize it after I created #36244.
This PR should not limit to the literals. I suggest changing the scope of casting strings to datetime types and follow #36244 in implementation.
I have also added you as co-author in #36244.

It would be great if you ping me in ANSI-related PRs, thanks!

formatter.parse("x123")
}.getMessage
assert(errMsg.contains("Cannot cast x123 to DateType"))
assert(errMsg.contains("Invalid `date` literal: 'x123'"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use toSQLType to output types, see #36233

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks for the source!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@anchovYu
Copy link
Contributor Author

"sqlState" : "42000"
},
"INVALID_FORMAT_FOR_CAST" : {
"message" : [ "Invalid input value for type %s: %s. To return NULL instead, use 'try_cast'. If necessary set %s to false to bypass this error.%s" ],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep the previous wording

Invalid input syntax for type ...

It is from PostgreSQL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

"message" : [ "Field name %s is invalid: %s is not a struct." ],
"sqlState" : "42000"
},
"INVALID_FORMAT_FOR_CAST" : {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about INVALID_SYNTAX_FOR_CAST

@anchovYu anchovYu requested a review from gengliangwang April 19, 2022 05:36
@anchovYu
Copy link
Contributor Author

new DateTimeException(s"Cannot cast $value to $to. To return NULL instead, use 'try_cast'. " +
s"If necessary set ${SQLConf.ANSI_ENABLED.key} to false to bypass this error." + errorContext)
new DateTimeException(s"Invalid input syntax for type ${toSQLType(to)}: " +
s"${if (value.isInstanceOf[UTF8String]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's save the string value to a val before line 1016

}

def fromStringANSI(str: UTF8String, errorContext: String = ""): Decimal = {
def fromStringANSI(str: UTF8String, to: DecimalType = DecimalType.USER_DEFAULT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def fromStringANSI(str: UTF8String, to: DecimalType = DecimalType.USER_DEFAULT,
def fromStringANSI(
str: UTF8String,
to: DecimalType = DecimalType.USER_DEFAULT,

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests

@anchovYu
Copy link
Contributor Author

@@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 142
-- Number of queries: 143
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, only increased the number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generated results, and did a check that 143 is the right number

@MaxGekk
Copy link
Member

MaxGekk commented Apr 19, 2022

+1, LGTM. Merging to master.
Thank you, @anchovYu.

@MaxGekk MaxGekk closed this in f76b3e7 Apr 19, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Apr 19, 2022

@anchovYu Could you backport the changes to branch-3.3, please.

anchovYu added a commit to anchovYu/spark that referenced this pull request Apr 20, 2022
### What changes were proposed in this pull request?
Improve the error messages for cast failures in ANSI.
As mentioned in https://issues.apache.org/jira/browse/SPARK-38929, this PR targets two cast-to types: numeric types and date types.
* For numeric(`int`, `smallint`, `double`, `float`, `decimal` ..) types, it embeds the cast-to types in the error message. For example,
  ```
  Invalid input value for type INT: '1.0'. To return NULL instead, use 'try_cast'. If necessary set %s to false to bypass this error.
  ```
  It uses the `toSQLType` and `toSQLValue` to wrap the corresponding types and literals.
* For date types, it does similarly as above. For example,
  ```
  Invalid input value for type TIMESTAMP: 'a'. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
  ```

### Why are the changes needed?
To improve the error message in general.

### Does this PR introduce _any_ user-facing change?
It changes the error messages.

### How was this patch tested?
The related unit tests are updated.

Closes apache#36241 from anchovYu/ansi-error-improve.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit f76b3e7)
@anchovYu
Copy link
Contributor Author

The cherrypick PR to 3.3: #36275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants