-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the decimal is (1, 0) and the string is something like
1111.1111?The string can be anything: a very big integer, a fraction with many digits after the dot, etc. I don't think there is a perfect solution, casting to double is the best we can do here.
I'd suggest end users to manually do the cast which fits their data best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.. I see what you mean. I agree. However, this wrong implicit type coercion is a huge bug potential (evidently we've found it in a few places) that causes wrong results.
what do you say that along the lines of SPARK-21646, we'll add another flag of "typeCoercion.mode" which will be a "safe mode". Just throw an AnalysisExcpetion when the user tries to compare unsafe types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @gatorsmile @mgaido91 I think it's time to look at the SQL standard and other mainstream databases, and see how shall we update the type coercions rules with safe mode. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think we have seen many issues on this. I don't think there is a standard for them, every RDBMS has different rules. The worst thing about the current rules IMHO is that they are not even coherent in Spark (see #19635 for instance).
The option I'd prefer is to follow Postgres behavior, ie. no implicit cast at all. When there is a type mismatch the user has to choose how to cast the things. It is a bit more effort on user side, but it is the safest option IMHO.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that too strict? I feel it's OK to compare an int with long. Maybe we should come up with a list of "definitely safe" type coercions, and allow them only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you when you say that it is "too strict", as this is an extreme approach. But that's how Postgres works and I do believe it has some benefits over other behaviors. I'd argue just a couple of things about what you are suggesting:
2014 = '2014 'should we return true or false? Most likely the users wants a true there, and most likely we would return false, which the user may (or may even not) realize only at the end of the job (which may mean several hours).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally agree with @cloud-fan that there are a few types that are "definitely safe", and as the user is not always responsible to his input tables, I believe convinience is more important than schema definitions. Also, even count() returns a bigint then you'll have to filter 'count(*)>100L' which means huge regression.
I believe that the "definitely safe" list is very short and we should use it. @mgaido91, in your examples I do agree that Double to Decimal is not safe and so is String to almost anything.
the trivial safes are something like (Long, Int), (Int, Double), (Decimal, Decimal) - that could be expanded to the same precision and scale, maybe (Data, TimeStamp)..