[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042

uzadude · 2018-11-15T08:18:20Z

What changes were proposed in this pull request?

Adding another excpetion rule for a popular case where ID columns are saved as decimal(x,0) and are compared with same ID columns in other tables which are saved as strings:

select '22222222222222222224' = 22222222222222222223BD -- returns true

this causes a major bug when we join to tables if one is defined as decimal and the other as string. in current situation we get a mix of good and bad results. we would prefer not to get results at all.

How was this patch tested?

added a unit tests and made some manual tests

…e saved as decimal(x, 0)

uzadude · 2018-11-15T08:20:44Z

@cloud-fan - could you please take a look?

cloud-fan · 2018-11-15T12:05:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala


+    // to support a popular use case of tables using Decimal(X, 0) for long IDs instead of strings
+    // see SPARK-26070 for more details
+    case (n: DecimalType, s: StringType) if n.scale == 0 => Some(DecimalType(n.precision, n.scale))


what if the decimal is (1, 0) and the string is something like 1111.1111?

The string can be anything: a very big integer, a fraction with many digits after the dot, etc. I don't think there is a perfect solution, casting to double is the best we can do here.

I'd suggest end users to manually do the cast which fits their data best.

yes.. I see what you mean. I agree. However, this wrong implicit type coercion is a huge bug potential (evidently we've found it in a few places) that causes wrong results.
what do you say that along the lines of SPARK-21646, we'll add another flag of "typeCoercion.mode" which will be a "safe mode". Just throw an AnalysisExcpetion when the user tries to compare unsafe types?

CC @gatorsmile @mgaido91 I think it's time to look at the SQL standard and other mainstream databases, and see how shall we update the type coercions rules with safe mode. What do you think?

@cloud-fan I think we have seen many issues on this. I don't think there is a standard for them, every RDBMS has different rules. The worst thing about the current rules IMHO is that they are not even coherent in Spark (see #19635 for instance).

The option I'd prefer is to follow Postgres behavior, ie. no implicit cast at all. When there is a type mismatch the user has to choose how to cast the things. It is a bit more effort on user side, but it is the safest option IMHO.

What do you think?

no implicit cast at all

Is that too strict? I feel it's OK to compare an int with long. Maybe we should come up with a list of "definitely safe" type coercions, and allow them only.

I agree with you when you say that it is "too strict", as this is an extreme approach. But that's how Postgres works and I do believe it has some benefits over other behaviors. I'd argue just a couple of things about what you are suggesting:

if you are comparing an int and a long, you most likely have to better define your schemas (not always true, of course). For instance in Oracle if you are not careful you end up pretty easily having a variable like YEAR being stored as an integer in some tables, as a string in some others and so on. Because Oracle does implicit casting so nobody realizes this bad design. With Postgres you do realize if you have a messy schema design like that. If it is not the case and your schemas are fine and you do need a casting, you can always add it by the way;

I think it is very hard to determine which are the "definitely safe" type coercions. For instance is casting a DOUBLE to a DECIMAL definitely safe? And an INT to STRING? We may say that casting an INT to STRING is safe, because there is no error at all in the conversion, but if we have something like 2014 = '2014 ' should we return true or false? Most likely the users wants a true there, and most likely we would return false, which the user may (or may even not) realize only at the end of the job (which may mean several hours).

I personally agree with @cloud-fan that there are a few types that are "definitely safe", and as the user is not always responsible to his input tables, I believe convinience is more important than schema definitions. Also, even count() returns a bigint then you'll have to filter 'count(*)>100L' which means huge regression.
I believe that the "definitely safe" list is very short and we should use it. @mgaido91, in your examples I do agree that Double to Decimal is not safe and so is String to almost anything.
the trivial safes are something like (Long, Int), (Int, Double), (Decimal, Decimal) - that could be expanded to the same precision and scale, maybe (Data, TimeStamp)..

AmplabJenkins · 2019-09-16T18:17:38Z

Can one of the admins verify this patch?

github-actions · 2020-01-04T00:06:34Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

add another exception rule for a wide use case in which ID columns ar…

5bb9749

…e saved as decimal(x, 0)

added a unit test

76b82cd

uzadude changed the title ~~[SPARK-26070] add rule for implicit type coercion for decimal(x,0)~~ [SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) Nov 15, 2018

cloud-fan reviewed Nov 15, 2018

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 4, 2020

github-actions bot closed this Jan 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042

[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042

Uh oh!

uzadude commented Nov 15, 2018 •

edited

Loading

Uh oh!

uzadude commented Nov 15, 2018

Uh oh!

cloud-fan Nov 15, 2018

Uh oh!

uzadude Nov 15, 2018

Uh oh!

cloud-fan Nov 16, 2018

Uh oh!

mgaido91 Nov 16, 2018

Uh oh!

cloud-fan Nov 17, 2018

Uh oh!

mgaido91 Nov 17, 2018

Uh oh!

uzadude Nov 18, 2018

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042

[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042

Uh oh!

Conversation

uzadude commented Nov 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

uzadude commented Nov 15, 2018

Uh oh!

cloud-fan Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

uzadude Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 16, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 16, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 17, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 17, 2018

Choose a reason for hiding this comment

Uh oh!

uzadude Nov 18, 2018

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

uzadude commented Nov 15, 2018 •

edited

Loading