Skip to content

Conversation

@uzadude
Copy link
Contributor

@uzadude uzadude commented Nov 15, 2018

What changes were proposed in this pull request?

Adding another excpetion rule for a popular case where ID columns are saved as decimal(x,0) and are compared with same ID columns in other tables which are saved as strings:

select '22222222222222222224' = 22222222222222222223BD -- returns true

this causes a major bug when we join to tables if one is defined as decimal and the other as string. in current situation we get a mix of good and bad results. we would prefer not to get results at all.

How was this patch tested?

added a unit tests and made some manual tests

@uzadude
Copy link
Contributor Author

uzadude commented Nov 15, 2018

@cloud-fan - could you please take a look?

@uzadude uzadude changed the title [SPARK-26070] add rule for implicit type coercion for decimal(x,0) [SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) Nov 15, 2018

// to support a popular use case of tables using Decimal(X, 0) for long IDs instead of strings
// see SPARK-26070 for more details
case (n: DecimalType, s: StringType) if n.scale == 0 => Some(DecimalType(n.precision, n.scale))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the decimal is (1, 0) and the string is something like 1111.1111?

The string can be anything: a very big integer, a fraction with many digits after the dot, etc. I don't think there is a perfect solution, casting to double is the best we can do here.

I'd suggest end users to manually do the cast which fits their data best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. I see what you mean. I agree. However, this wrong implicit type coercion is a huge bug potential (evidently we've found it in a few places) that causes wrong results.
what do you say that along the lines of SPARK-21646, we'll add another flag of "typeCoercion.mode" which will be a "safe mode". Just throw an AnalysisExcpetion when the user tries to compare unsafe types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @gatorsmile @mgaido91 I think it's time to look at the SQL standard and other mainstream databases, and see how shall we update the type coercions rules with safe mode. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I think we have seen many issues on this. I don't think there is a standard for them, every RDBMS has different rules. The worst thing about the current rules IMHO is that they are not even coherent in Spark (see #19635 for instance).

The option I'd prefer is to follow Postgres behavior, ie. no implicit cast at all. When there is a type mismatch the user has to choose how to cast the things. It is a bit more effort on user side, but it is the safest option IMHO.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no implicit cast at all

Is that too strict? I feel it's OK to compare an int with long. Maybe we should come up with a list of "definitely safe" type coercions, and allow them only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you when you say that it is "too strict", as this is an extreme approach. But that's how Postgres works and I do believe it has some benefits over other behaviors. I'd argue just a couple of things about what you are suggesting:

  • if you are comparing an int and a long, you most likely have to better define your schemas (not always true, of course). For instance in Oracle if you are not careful you end up pretty easily having a variable like YEAR being stored as an integer in some tables, as a string in some others and so on. Because Oracle does implicit casting so nobody realizes this bad design. With Postgres you do realize if you have a messy schema design like that. If it is not the case and your schemas are fine and you do need a casting, you can always add it by the way;
  • I think it is very hard to determine which are the "definitely safe" type coercions. For instance is casting a DOUBLE to a DECIMAL definitely safe? And an INT to STRING? We may say that casting an INT to STRING is safe, because there is no error at all in the conversion, but if we have something like 2014 = '2014 ' should we return true or false? Most likely the users wants a true there, and most likely we would return false, which the user may (or may even not) realize only at the end of the job (which may mean several hours).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally agree with @cloud-fan that there are a few types that are "definitely safe", and as the user is not always responsible to his input tables, I believe convinience is more important than schema definitions. Also, even count() returns a bigint then you'll have to filter 'count(*)>100L' which means huge regression.
I believe that the "definitely safe" list is very short and we should use it. @mgaido91, in your examples I do agree that Double to Decimal is not safe and so is String to almost anything.
the trivial safes are something like (Long, Int), (Int, Double), (Decimal, Decimal) - that could be expanded to the same precision and scale, maybe (Data, TimeStamp)..

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 4, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 4, 2020
@github-actions github-actions bot closed this Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants