[SPARK-16714][SQL] Refactor type widening for consistency #14389

petermaxlee · 2016-07-28T07:37:29Z

What changes were proposed in this pull request?

This patch refactors type widening and makes the usage of them in expressions more consistent.

Before this patch, we have the following 6 functions for type widening (and their usage):

findTightestCommonTypeOfTwo (binary version)
- BinaryOperator
- IfNull
- NullIf
- Nvl2
- JSON schema inference

findTightestCommonTypeToString (binary version)
- Nvl

findTightestCommonTypeAndPromoteToString (n-ary version)
- CreateArray
- CreateMap

findTightestCommonType (n-ary version)
- Greatest
- Least

findWiderTypeForTwo (binary version)
- IfCoercion

findWiderCommonType (n-ary version)
- WidenSetOperationTypes
- InConversion
- Coalesce
- CaseWhenCoercion

After this patch, we have only 3 functions for type widening (and their usage):

findTightestCommonTypeOfTwo (binary version)
- BinaryOperator
- JSON schema inference

findWiderTypeForTwo (binary version)
- IfCoercion
- Nvl
- IfNull
- NullIf
- Nvl2

findWiderCommonType (n-ary version)
- WidenSetOperationTypes
- InConversion
- Coalesce
- CaseWhenCoercion
- Greatest
- Least
- CreateArray
- CreateMap

As a result, this patch changes the type coercion rule for the aforementioned functions so they can accept decimals of different precision/scale. This is not a regression from Spark 1.x, but it is a much bigger problem in Spark 2.0 because floating point literals are parsed as decimals. Queries below would fail in Spark 2.0:

scala> sql("select array(0.001, 0.02)")
org.apache.spark.sql.AnalysisException: cannot resolve `array(CAST(0.001 AS DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))` due to data type mismatch: input to function array should all be the same type, but it's [decimal(3,3), decimal(2,2)]; line 1 pos 7

How was this patch tested?

Created a new end-to-end test suite SQLTypeCoercionSuite. In the future we can move all other type checking tests here. I first tried adding a test in SQLQuerySuite but the suite was clearly already too large.

…ifferent precision/scale

petermaxlee · 2016-07-28T07:38:08Z

This should resolve the following two pull requests as well:

#14353
#14374

petermaxlee · 2016-07-28T07:43:13Z

I was looking at the code, and I think this is a more general problem with decimal widening. The same problem exists for least, and other functions.

scala> sql("select least(0.1, 0.01)").collect()
org.apache.spark.sql.AnalysisException: cannot resolve 'least(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS DECIMAL(2,2)))' due to data type mismatch: The expressions should all have the same type, got LEAST (ArrayBuffer(DecimalType(1,1), DecimalType(2,2))).; line 1 pos 7

petermaxlee · 2016-07-28T07:46:12Z

@dongjoon-hyun You only had one test case didn't you? I don't think that test case is useful since it was testing specifically checkInputDataTypes, which was not the right thing to test. Type coercion should be handled by the analyzer, not the expression's type checking.

petermaxlee · 2016-07-28T07:47:41Z

sql/core/src/test/scala/org/apache/spark/sql/SQLTypeCoercionSuite.scala

+ */
+class SQLTypeCoercionSuite extends QueryTest with SharedSQLContext {
+
+  test("SPARK-16714 decimal in map and struct") {


i had a mistake here about the naming. will fix it later.

HyukjinKwon · 2016-07-28T07:47:55Z

Yea, for least and greatest, I opened this here #14294. Actually, I am worried if allowing lose of precision and fractions is okay.

I first thought this should only allow widening within the range that it does not lose some values but it seems some think it should be just truncated and Hive does this by always falling back to double.

Please refer https://issues.apache.org/jira/browse/SPARK-16646.

FYI, for other functions it looks okay. It seems no case similar with this anymore.

dongjoon-hyun · 2016-07-28T07:48:02Z

@petermaxlee . Yep. I deleted my request, but you had better have a test case with real columns on table data. :)

HyukjinKwon · 2016-07-28T07:50:58Z

cc @cloud-fan and @liancheng

cloud-fan · 2016-07-28T08:03:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

      case a @ CreateArray(children) if !haveSameType(children) =>
        val types = children.map(_.dataType)
-        findTightestCommonTypeAndPromoteToString(types) match {
+        findWiderCommonType(types) match {


Does hive allow precision lose for this case?

In current master, yes it seems so,

I fixed the example. It seems the precision is being truncated.

hive> SELECT array(10000000000000000000.5BD, 1.00000000000000005123BD); OK [10000000000000000000.5,1.000000000000000051] Time taken: 0.06 seconds, Fetched: 1 row(s)

and it seems

hive> SELECT array(10000000000000000000, 1.0000000000000005123BD); OK [1.0E19,1.0000000000000004] Time taken: 0.061 seconds, Fetched: 1 row(s)

it becomes to double when the types are different. I will look into the codes deeper and update you if you want.

cloud-fan · 2016-07-28T08:21:55Z

One problem of decimal type in Spark SQL is, the wider type of 2 decimal types may be illegal(exceed system limitation), then we have to truncate and suffer precision lose. This forces us to make decisions about which functions can accept precision lose and which can not.

Unfortunately, this is not a common problem(e.g. MySQL and Postgres don't have this problem) so we don't have many similar systems to compare and follow.

MySQL's decimal type's max scale is half of the max precision, so the wider type of 2 decimal types in MySQL will never exceed system limitation.
Postgres has a kind of unlimited decimal type, so it doesn't have this problem at all.

I think MySQL's design is a good one to follow, cc @rxin @marmbrus @yhuai what do you think?

cloud-fan · 2016-07-28T08:27:01Z

If we want to just fix these bugs, I think we should come up with a list about which functions(need arguments of same type) can accept precision lose and which can not.

HyukjinKwon · 2016-07-28T08:32:03Z

FYI, I had a look before. Its map, array, greatest and least.

SparkQA · 2016-07-28T09:40:37Z

Test build #62958 has finished for PR 14389 at commit 9774605.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-07-28T18:05:11Z

I examined the usage of various type coercion rules, and here are how they are used:


findTightestCommonTypeOfTwo

- BinaryOperator
- IfNull
- NullIf
- Nvl2
- JSON schema inference

findTightestCommonTypeToString

- Nvl

findTightestCommonTypeAndPromoteToString

- CreateArray
- CreateMap

findTightestCommonType

- Greatest
- Least

findWiderTypeForTwo

- IfCoercion

findWiderCommonType

- WidenSetOperationTypes
- InConversion
- Coalesce
- CaseWhenCoercion

petermaxlee · 2016-07-29T08:02:12Z

I've pushed a new change to make it consistent for all the instances that I could find. I believe the new one has more consistent behavior across functions and simpler to understand.

HyukjinKwon · 2016-07-29T08:02:24Z

@petermaxlee At least for least and greatest, please refer https://issues.apache.org/jira/browse/SPARK-16646. It seems we should hold it for now.

petermaxlee · 2016-07-29T08:06:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

  }

-  /** Similar to [[findTightestCommonType]], but can promote all the way to StringType. */
-  def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = {


I inlined this into findWiderTypeForTwo

petermaxlee · 2016-07-29T08:08:53Z

@HyukjinKwon FWIW I don't think it makes sense to make everything consistent except greatest/least. It also does not make sense to automatically cast from decimal to double type for these two functions. The reason I would want to use decimal is to make sure there is no loss of precision and casting to double violates that.

The decimal truncation problem described in SPARK-16646 seems orthogonal to this. It would be better if we don't need to worry about truncation (e.g. with unlimited decimal or with precision always double the size of scale), but I don't think that should impact what type greatest/least use.

HyukjinKwon · 2016-07-29T08:16:19Z

Have you read this comment?

We are discussing this internally, can you hold it for a while? We may decide to increase the max precision to 76 and keep max scale as 38, then we don't have this problem.

SparkQA · 2016-07-29T09:23:55Z

Test build #62999 has finished for PR 14389 at commit afca003.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-29T09:28:26Z

Test build #63000 has finished for PR 14389 at commit 929c39f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-29T09:38:11Z

Test build #63001 has finished for PR 14389 at commit 071b01d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-29T10:42:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

  /**
   * Case 2 type widening (see the classdoc comment above for TypeCoercion).
   *
   * i.e. the main difference with [[findTightestCommonTypeOfTwo]] is that here we allow some


also mention the string promotion here?

cloud-fan · 2016-07-29T12:35:17Z

@petermaxlee , after your patch, findTightestCommonTypeOfTwo is only used by BinaryOperator and JSON schema inference. I've checked all the BinaryOperator implementations. Except for something like And, BitwiseOr that don't accept decimal types, all other implementations will be handled in DecimalPrecision. This means, there is no expression needs the semantic of findTightestCommonTypeOfTwo: find the tightest common type of two types without precision loss. Can you check about JSON schema inference? If it doesn't need this semantic either, I think we can safely remove it and only use findWiderTypeForTwo.

cloud-fan · 2016-07-29T12:56:01Z

and can you make sure that it's safe to use findWiderTypeForTwo for Nvl and Nvl2?

petermaxlee · 2016-07-29T18:45:46Z

I don't believe the remaining use cases of findTightestCommonTypeOfTwo are necessary. That said, to get rid of those use cases would require more refactoring. I'm assuming we want to fix this issue in 2.0.1, so perhaps it is best to do the refactoring separately only for the master branch.

SparkQA · 2016-07-29T18:50:35Z

Test build #63007 has finished for PR 14389 at commit b9f94fe.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-07-29T18:59:38Z

And it's consistent to handle Nvl and Nvl2 the same way we handle other functions of similar kind (e.g. coalesce).

petermaxlee · 2016-07-29T20:59:24Z

I ran the Python tests locally and they passed. It looks like the failure was caused by the Jenkins shutdown.

SparkQA · 2016-07-29T22:58:43Z

Test build #3196 has finished for PR 14389 at commit b9f94fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-30T06:27:35Z

Test build #63034 has finished for PR 14389 at commit ffd1734.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-01T14:10:46Z

currently we have 3 different semantics for finding a wider type:

findTightestCommonType: try to find a wider type, but skip decimal types
findTightestCommonTypeToString: similar to rule 1, but if one side is string, it can promote the other side to string
findWiderCommonType: similar to rule 2, but handles decimal type, and truncate it if necessary.

It makes sense to have 2 different semantics for string promotion, however, I don't think we need 2 different semantics for decimal types. There is no such a function that need its arguments to be same type, but can't accept precision loss for decimal type. I think it's better to run the query and log a warning message about the truncation, rather than fail the query.

We only need 2 semantics:

findWiderType: try to find the wider type, including decimal type, and truncate if necessary
findWiderTypeAndPromoteToString: similar to rule 1, but handles string promotion.

We also need to add some checks before applying the type widen rules, to avoid conflicting with DecimalPrecision, which defines some special rules for binary arithmetic about decimal type.

@petermaxlee in your PR, the string promotion semantic is hidden in findWiderType, and makes greatest and least accept string promotion, which is not expected. What do you think?

petermaxlee · 2016-08-01T17:32:41Z

This depends on what we want for greatest/least. These two expressions do support string type as input.

cloud-fan · 2016-08-02T06:08:25Z

for greatest(1, '2'), Hive, MySQL and Postgres will turn the string into double, instead of promoting the integer to string. This is the same behaviour with most of the arithmetic expressions, e.g. Add, Minus, Divide, etc. I think it makes sense and Spark SQL should follow it.

petermaxlee · 2016-08-02T06:12:08Z

OK that makes sense!

cloud-fan · 2016-08-05T13:16:05Z

hi @petermaxlee , are you going to update this?

petermaxlee · 2016-08-05T22:12:23Z

Which part do you want me to update? I thought you've already committed the changes needed. Let me know and I will update this.

cloud-fan · 2016-08-08T01:36:36Z

to refactor the type widen rules, i.e. we only need 2 rules: findWiderType and findWiderTypeAndPromoteToString.

petermaxlee · 2016-08-18T04:51:19Z

Sorry that it has taken this long. I have submitted a work in progress pull request at #14696

Going to close this one and continue the work there, since it is a fairly different pull request.

[SPARK-16714][SQL] map, struct function should accept decimals with d…

9774605

…ifferent precision/scale

This was referenced Jul 28, 2016

[SPARK-16714][SQL] array should create a decimal array from decimals with different precisions and scales #14353

Closed

[SPARK-16735][SQL] map should create a decimal key or value from decimals with different precisions and scales #14374

Closed

petermaxlee changed the title ~~[SPARK-16714][SQL] map, struct function should accept decimals with different precision/scale~~ [SPARK-16714][SQL] map, array function should accept decimals with different precision/scale Jul 28, 2016

petermaxlee reviewed Jul 28, 2016
View reviewed changes

cloud-fan reviewed Jul 28, 2016
View reviewed changes

petermaxlee added 3 commits July 29, 2016 00:50

Make it more consistent.

57bbe81

Fix compile

afca003

Remove unused methods.

929c39f

petermaxlee changed the title ~~[SPARK-16714][SQL] map, array function should accept decimals with different precision/scale~~ [SPARK-16714][SQL] Make function type coercion more consistent Jul 29, 2016

petermaxlee changed the title ~~[SPARK-16714][SQL] Make function type coercion more consistent~~ [SPARK-16714][SQL] Make function type widening more consistent Jul 29, 2016

Rename test case

071b01d

petermaxlee changed the title ~~[SPARK-16714][SQL] Make function type widening more consistent~~ [SPARK-16714][SQL] Refactor function type widening to make them more consistent Jul 29, 2016

petermaxlee changed the title ~~[SPARK-16714][SQL] Refactor function type widening to make them more consistent~~ [SPARK-16714][SQL] Refactor type widening for consistency Jul 29, 2016

petermaxlee reviewed Jul 29, 2016
View reviewed changes

cloud-fan reviewed Jul 29, 2016
View reviewed changes

Fix test

b9f94fe

Code review

ffd1734

cloud-fan mentioned this pull request Aug 1, 2016

[SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type coercion should handle decimal type #14439

Closed

petermaxlee mentioned this pull request Aug 18, 2016

[SPARK-16714][SQL] Refactor type widening for consistency - WIP #14696

Closed

petermaxlee closed this Aug 18, 2016

[SPARK-16714][SQL] Refactor type widening for consistency #14389

[SPARK-16714][SQL] Refactor type widening for consistency #14389

Uh oh!

Conversation

petermaxlee commented Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

petermaxlee commented Jul 28, 2016

Uh oh!

petermaxlee commented Jul 28, 2016

Uh oh!

petermaxlee commented Jul 28, 2016

Uh oh!

petermaxlee Jul 28, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 28, 2016

Uh oh!

HyukjinKwon commented Jul 28, 2016

Uh oh!

cloud-fan Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 28, 2016

Uh oh!

cloud-fan commented Jul 28, 2016

Uh oh!

HyukjinKwon commented Jul 28, 2016

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

petermaxlee commented Jul 28, 2016

Uh oh!

petermaxlee commented Jul 29, 2016

Uh oh!

HyukjinKwon commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermaxlee Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

petermaxlee commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 29, 2016

Uh oh!

SparkQA commented Jul 29, 2016

Uh oh!

SparkQA commented Jul 29, 2016

Uh oh!

SparkQA commented Jul 29, 2016

Uh oh!

cloud-fan Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

petermaxlee Jul 30, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jul 29, 2016

Uh oh!

petermaxlee commented Jul 29, 2016

Uh oh!

SparkQA commented Jul 29, 2016

Uh oh!

petermaxlee commented Jul 29, 2016

Uh oh!

petermaxlee commented Jul 29, 2016

petermaxlee commented Jul 28, 2016 •

edited

Loading

HyukjinKwon commented Jul 28, 2016 •

edited

Loading

cloud-fan Jul 28, 2016 •

edited

Loading

HyukjinKwon Jul 28, 2016 •

edited

Loading

HyukjinKwon commented Jul 29, 2016 •

edited

Loading

petermaxlee commented Jul 29, 2016 •

edited

Loading

cloud-fan commented Jul 29, 2016 •

edited

Loading