Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,11 @@ object TypeCoercion {
case (DateType, TimestampType)
=> if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else Some(StringType)

// to support a popular use case of tables using Decimal(X, 0) for long IDs instead of strings
// see SPARK-26070 for more details
case (n: DecimalType, s: StringType) if n.scale == 0 => Some(DecimalType(n.precision, n.scale))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the decimal is (1, 0) and the string is something like 1111.1111?

The string can be anything: a very big integer, a fraction with many digits after the dot, etc. I don't think there is a perfect solution, casting to double is the best we can do here.

I'd suggest end users to manually do the cast which fits their data best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. I see what you mean. I agree. However, this wrong implicit type coercion is a huge bug potential (evidently we've found it in a few places) that causes wrong results.
what do you say that along the lines of SPARK-21646, we'll add another flag of "typeCoercion.mode" which will be a "safe mode". Just throw an AnalysisExcpetion when the user tries to compare unsafe types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @gatorsmile @mgaido91 I think it's time to look at the SQL standard and other mainstream databases, and see how shall we update the type coercions rules with safe mode. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I think we have seen many issues on this. I don't think there is a standard for them, every RDBMS has different rules. The worst thing about the current rules IMHO is that they are not even coherent in Spark (see #19635 for instance).

The option I'd prefer is to follow Postgres behavior, ie. no implicit cast at all. When there is a type mismatch the user has to choose how to cast the things. It is a bit more effort on user side, but it is the safest option IMHO.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no implicit cast at all

Is that too strict? I feel it's OK to compare an int with long. Maybe we should come up with a list of "definitely safe" type coercions, and allow them only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you when you say that it is "too strict", as this is an extreme approach. But that's how Postgres works and I do believe it has some benefits over other behaviors. I'd argue just a couple of things about what you are suggesting:

  • if you are comparing an int and a long, you most likely have to better define your schemas (not always true, of course). For instance in Oracle if you are not careful you end up pretty easily having a variable like YEAR being stored as an integer in some tables, as a string in some others and so on. Because Oracle does implicit casting so nobody realizes this bad design. With Postgres you do realize if you have a messy schema design like that. If it is not the case and your schemas are fine and you do need a casting, you can always add it by the way;
  • I think it is very hard to determine which are the "definitely safe" type coercions. For instance is casting a DOUBLE to a DECIMAL definitely safe? And an INT to STRING? We may say that casting an INT to STRING is safe, because there is no error at all in the conversion, but if we have something like 2014 = '2014 ' should we return true or false? Most likely the users wants a true there, and most likely we would return false, which the user may (or may even not) realize only at the end of the job (which may mean several hours).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally agree with @cloud-fan that there are a few types that are "definitely safe", and as the user is not always responsible to his input tables, I believe convinience is more important than schema definitions. Also, even count() returns a bigint then you'll have to filter 'count(*)>100L' which means huge regression.
I believe that the "definitely safe" list is very short and we should use it. @mgaido91, in your examples I do agree that Double to Decimal is not safe and so is String to almost anything.
the trivial safes are something like (Long, Int), (Int, Double), (Decimal, Decimal) - that could be expanded to the same precision and scale, maybe (Data, TimeStamp)..

case (s: StringType, n: DecimalType) if n.scale == 0 => Some(DecimalType(n.precision, n.scale))

// There is no proper decimal type we can pick,
// using double type is the best we can do.
// See SPARK-22469 for details.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1490,6 +1490,10 @@ class TypeCoercionSuite extends AnalysisTest {
GreaterThan(Literal("1.5"), Literal(BigDecimal("0.5"))),
GreaterThan(Cast(Literal("1.5"), DoubleType), Cast(Literal(BigDecimal("0.5")),
DoubleType)))
ruleTest(rule,
GreaterThan(Literal("22222222222222222224"), Literal(Decimal("22222222222222222223"))),
GreaterThan(Cast(Literal("22222222222222222224"), DecimalType(20, 0)),
Literal(Decimal("22222222222222222223"))))
Seq(true, false).foreach { convertToTS =>
withSQLConf(
"spark.sql.legacy.compareDateTimestampInTimestamp" -> convertToTS.toString) {
Expand Down