-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16714][SQL] Refactor type widening for consistency #14389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ifferent precision/scale
|
I was looking at the code, and I think this is a more general problem with decimal widening. The same problem exists for least, and other functions. |
|
@dongjoon-hyun You only had one test case didn't you? I don't think that test case is useful since it was testing specifically checkInputDataTypes, which was not the right thing to test. Type coercion should be handled by the analyzer, not the expression's type checking. |
| */ | ||
| class SQLTypeCoercionSuite extends QueryTest with SharedSQLContext { | ||
|
|
||
| test("SPARK-16714 decimal in map and struct") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had a mistake here about the naming. will fix it later.
|
Yea, for least and greatest, I opened this here #14294. Actually, I am worried if allowing lose of precision and fractions is okay. I first thought this should only allow widening within the range that it does not lose some values but it seems some think it should be just truncated and Hive does this by always falling back to double. Please refer https://issues.apache.org/jira/browse/SPARK-16646. FYI, for other functions it looks okay. It seems no case similar with this anymore. |
|
@petermaxlee . Yep. I deleted my request, but you had better have a test case with real columns on table data. :) |
|
cc @cloud-fan and @liancheng |
| case a @ CreateArray(children) if !haveSameType(children) => | ||
| val types = children.map(_.dataType) | ||
| findTightestCommonTypeAndPromoteToString(types) match { | ||
| findWiderCommonType(types) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does hive allow precision lose for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current master, yes it seems so,
I fixed the example. It seems the precision is being truncated.
hive> SELECT array(10000000000000000000.5BD, 1.00000000000000005123BD);
OK
[10000000000000000000.5,1.000000000000000051]
Time taken: 0.06 seconds, Fetched: 1 row(s)
and it seems
hive> SELECT array(10000000000000000000, 1.0000000000000005123BD);
OK
[1.0E19,1.0000000000000004]
Time taken: 0.061 seconds, Fetched: 1 row(s)
it becomes to double when the types are different. I will look into the codes deeper and update you if you want.
|
One problem of decimal type in Spark SQL is, the wider type of 2 decimal types may be illegal(exceed system limitation), then we have to truncate and suffer precision lose. This forces us to make decisions about which functions can accept precision lose and which can not. Unfortunately, this is not a common problem(e.g. MySQL and Postgres don't have this problem) so we don't have many similar systems to compare and follow. MySQL's decimal type's max scale is half of the max precision, so the wider type of 2 decimal types in MySQL will never exceed system limitation. I think MySQL's design is a good one to follow, cc @rxin @marmbrus @yhuai what do you think? |
|
If we want to just fix these bugs, I think we should come up with a list about which functions(need arguments of same type) can accept precision lose and which can not. |
|
FYI, I had a look before. Its map, array, greatest and least. |
|
Test build #62958 has finished for PR 14389 at commit
|
|
I examined the usage of various type coercion rules, and here are how they are used: |
|
I've pushed a new change to make it consistent for all the instances that I could find. I believe the new one has more consistent behavior across functions and simpler to understand. |
|
@petermaxlee At least for |
| } | ||
|
|
||
| /** Similar to [[findTightestCommonType]], but can promote all the way to StringType. */ | ||
| def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inlined this into findWiderTypeForTwo
|
@HyukjinKwon FWIW I don't think it makes sense to make everything consistent except greatest/least. It also does not make sense to automatically cast from decimal to double type for these two functions. The reason I would want to use decimal is to make sure there is no loss of precision and casting to double violates that. The decimal truncation problem described in SPARK-16646 seems orthogonal to this. It would be better if we don't need to worry about truncation (e.g. with unlimited decimal or with precision always double the size of scale), but I don't think that should impact what type greatest/least use. |
|
Have you read this comment?
|
|
Test build #62999 has finished for PR 14389 at commit
|
|
Test build #63000 has finished for PR 14389 at commit
|
|
Test build #63001 has finished for PR 14389 at commit
|
| /** | ||
| * Case 2 type widening (see the classdoc comment above for TypeCoercion). | ||
| * | ||
| * i.e. the main difference with [[findTightestCommonTypeOfTwo]] is that here we allow some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mention the string promotion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@petermaxlee , after your patch, |
|
and can you make sure that it's safe to use |
|
I don't believe the remaining use cases of findTightestCommonTypeOfTwo are necessary. That said, to get rid of those use cases would require more refactoring. I'm assuming we want to fix this issue in 2.0.1, so perhaps it is best to do the refactoring separately only for the master branch. |
|
Test build #63007 has finished for PR 14389 at commit
|
|
And it's consistent to handle Nvl and Nvl2 the same way we handle other functions of similar kind (e.g. coalesce). |
|
I ran the Python tests locally and they passed. It looks like the failure was caused by the Jenkins shutdown. |
|
Test build #3196 has finished for PR 14389 at commit
|
|
Test build #63034 has finished for PR 14389 at commit
|
|
currently we have 3 different semantics for finding a wider type:
It makes sense to have 2 different semantics for string promotion, however, I don't think we need 2 different semantics for decimal types. There is no such a function that need its arguments to be same type, but can't accept precision loss for decimal type. I think it's better to run the query and log a warning message about the truncation, rather than fail the query. We only need 2 semantics:
We also need to add some checks before applying the type widen rules, to avoid conflicting with @petermaxlee in your PR, the string promotion semantic is hidden in |
|
This depends on what we want for greatest/least. These two expressions do support string type as input. |
|
for |
|
OK that makes sense! |
|
hi @petermaxlee , are you going to update this? |
|
Which part do you want me to update? I thought you've already committed the changes needed. Let me know and I will update this. |
|
to refactor the type widen rules, i.e. we only need 2 rules: |
|
Sorry that it has taken this long. I have submitted a work in progress pull request at #14696 Going to close this one and continue the work there, since it is a fairly different pull request. |
What changes were proposed in this pull request?
This patch refactors type widening and makes the usage of them in expressions more consistent.
Before this patch, we have the following 6 functions for type widening (and their usage):
After this patch, we have only 3 functions for type widening (and their usage):
As a result, this patch changes the type coercion rule for the aforementioned functions so they can accept decimals of different precision/scale. This is not a regression from Spark 1.x, but it is a much bigger problem in Spark 2.0 because floating point literals are parsed as decimals. Queries below would fail in Spark 2.0:
How was this patch tested?
Created a new end-to-end test suite SQLTypeCoercionSuite. In the future we can move all other type checking tests here. I first tried adding a test in SQLQuerySuite but the suite was clearly already too large.