Skip to content

Conversation

@petermaxlee
Copy link
Contributor

@petermaxlee petermaxlee commented Jul 28, 2016

What changes were proposed in this pull request?

This patch refactors type widening and makes the usage of them in expressions more consistent.

Before this patch, we have the following 6 functions for type widening (and their usage):

findTightestCommonTypeOfTwo (binary version)
- BinaryOperator
- IfNull
- NullIf
- Nvl2
- JSON schema inference

findTightestCommonTypeToString (binary version)
- Nvl

findTightestCommonTypeAndPromoteToString (n-ary version)
- CreateArray
- CreateMap

findTightestCommonType (n-ary version)
- Greatest
- Least

findWiderTypeForTwo (binary version)
- IfCoercion

findWiderCommonType (n-ary version)
- WidenSetOperationTypes
- InConversion
- Coalesce
- CaseWhenCoercion

After this patch, we have only 3 functions for type widening (and their usage):

findTightestCommonTypeOfTwo (binary version)
- BinaryOperator
- JSON schema inference

findWiderTypeForTwo (binary version)
- IfCoercion
- Nvl
- IfNull
- NullIf
- Nvl2

findWiderCommonType (n-ary version)
- WidenSetOperationTypes
- InConversion
- Coalesce
- CaseWhenCoercion
- Greatest
- Least
- CreateArray
- CreateMap

As a result, this patch changes the type coercion rule for the aforementioned functions so they can accept decimals of different precision/scale. This is not a regression from Spark 1.x, but it is a much bigger problem in Spark 2.0 because floating point literals are parsed as decimals. Queries below would fail in Spark 2.0:

scala> sql("select array(0.001, 0.02)")
org.apache.spark.sql.AnalysisException: cannot resolve `array(CAST(0.001 AS DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))` due to data type mismatch: input to function array should all be the same type, but it's [decimal(3,3), decimal(2,2)]; line 1 pos 7

How was this patch tested?

Created a new end-to-end test suite SQLTypeCoercionSuite. In the future we can move all other type checking tests here. I first tried adding a test in SQLQuerySuite but the suite was clearly already too large.

@petermaxlee
Copy link
Contributor Author

This should resolve the following two pull requests as well:

#14353
#14374

@petermaxlee petermaxlee changed the title [SPARK-16714][SQL] map, struct function should accept decimals with different precision/scale [SPARK-16714][SQL] map, array function should accept decimals with different precision/scale Jul 28, 2016
@petermaxlee
Copy link
Contributor Author

I was looking at the code, and I think this is a more general problem with decimal widening. The same problem exists for least, and other functions.

scala> sql("select least(0.1, 0.01)").collect()
org.apache.spark.sql.AnalysisException: cannot resolve 'least(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS DECIMAL(2,2)))' due to data type mismatch: The expressions should all have the same type, got LEAST (ArrayBuffer(DecimalType(1,1), DecimalType(2,2))).; line 1 pos 7

@petermaxlee
Copy link
Contributor Author

@dongjoon-hyun You only had one test case didn't you? I don't think that test case is useful since it was testing specifically checkInputDataTypes, which was not the right thing to test. Type coercion should be handled by the analyzer, not the expression's type checking.

*/
class SQLTypeCoercionSuite extends QueryTest with SharedSQLContext {

test("SPARK-16714 decimal in map and struct") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had a mistake here about the naming. will fix it later.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 28, 2016

Yea, for least and greatest, I opened this here #14294. Actually, I am worried if allowing lose of precision and fractions is okay.

I first thought this should only allow widening within the range that it does not lose some values but it seems some think it should be just truncated and Hive does this by always falling back to double.

Please refer https://issues.apache.org/jira/browse/SPARK-16646.

FYI, for other functions it looks okay. It seems no case similar with this anymore.

@dongjoon-hyun
Copy link
Member

@petermaxlee . Yep. I deleted my request, but you had better have a test case with real columns on table data. :)

@HyukjinKwon
Copy link
Member

cc @cloud-fan and @liancheng

case a @ CreateArray(children) if !haveSameType(children) =>
val types = children.map(_.dataType)
findTightestCommonTypeAndPromoteToString(types) match {
findWiderCommonType(types) match {
Copy link
Contributor

@cloud-fan cloud-fan Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does hive allow precision lose for this case?

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current master, yes it seems so,

I fixed the example. It seems the precision is being truncated.

hive> SELECT array(10000000000000000000.5BD, 1.00000000000000005123BD);
OK
[10000000000000000000.5,1.000000000000000051]
Time taken: 0.06 seconds, Fetched: 1 row(s)

and it seems

hive> SELECT array(10000000000000000000, 1.0000000000000005123BD);
OK
[1.0E19,1.0000000000000004]
Time taken: 0.061 seconds, Fetched: 1 row(s)

it becomes to double when the types are different. I will look into the codes deeper and update you if you want.

@cloud-fan
Copy link
Contributor

One problem of decimal type in Spark SQL is, the wider type of 2 decimal types may be illegal(exceed system limitation), then we have to truncate and suffer precision lose. This forces us to make decisions about which functions can accept precision lose and which can not.

Unfortunately, this is not a common problem(e.g. MySQL and Postgres don't have this problem) so we don't have many similar systems to compare and follow.

MySQL's decimal type's max scale is half of the max precision, so the wider type of 2 decimal types in MySQL will never exceed system limitation.
Postgres has a kind of unlimited decimal type, so it doesn't have this problem at all.

I think MySQL's design is a good one to follow, cc @rxin @marmbrus @yhuai what do you think?

@cloud-fan
Copy link
Contributor

If we want to just fix these bugs, I think we should come up with a list about which functions(need arguments of same type) can accept precision lose and which can not.

@HyukjinKwon
Copy link
Member

FYI, I had a look before. Its map, array, greatest and least.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62958 has finished for PR 14389 at commit 9774605.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@petermaxlee
Copy link
Contributor Author

I examined the usage of various type coercion rules, and here are how they are used:


findTightestCommonTypeOfTwo

- BinaryOperator
- IfNull
- NullIf
- Nvl2
- JSON schema inference

findTightestCommonTypeToString

- Nvl

findTightestCommonTypeAndPromoteToString

- CreateArray
- CreateMap

findTightestCommonType

- Greatest
- Least

findWiderTypeForTwo

- IfCoercion

findWiderCommonType

- WidenSetOperationTypes
- InConversion
- Coalesce
- CaseWhenCoercion

@petermaxlee petermaxlee changed the title [SPARK-16714][SQL] map, array function should accept decimals with different precision/scale [SPARK-16714][SQL] Make function type coercion more consistent Jul 29, 2016
@petermaxlee
Copy link
Contributor Author

I've pushed a new change to make it consistent for all the instances that I could find. I believe the new one has more consistent behavior across functions and simpler to understand.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 29, 2016

@petermaxlee At least for least and greatest, please refer https://issues.apache.org/jira/browse/SPARK-16646. It seems we should hold it for now.

@petermaxlee petermaxlee changed the title [SPARK-16714][SQL] Make function type coercion more consistent [SPARK-16714][SQL] Make function type widening more consistent Jul 29, 2016
@petermaxlee petermaxlee changed the title [SPARK-16714][SQL] Make function type widening more consistent [SPARK-16714][SQL] Refactor function type widening to make them more consistent Jul 29, 2016
@petermaxlee petermaxlee changed the title [SPARK-16714][SQL] Refactor function type widening to make them more consistent [SPARK-16714][SQL] Refactor type widening for consistency Jul 29, 2016
}

/** Similar to [[findTightestCommonType]], but can promote all the way to StringType. */
def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined this into findWiderTypeForTwo

@petermaxlee
Copy link
Contributor Author

petermaxlee commented Jul 29, 2016

@HyukjinKwon FWIW I don't think it makes sense to make everything consistent except greatest/least. It also does not make sense to automatically cast from decimal to double type for these two functions. The reason I would want to use decimal is to make sure there is no loss of precision and casting to double violates that.

The decimal truncation problem described in SPARK-16646 seems orthogonal to this. It would be better if we don't need to worry about truncation (e.g. with unlimited decimal or with precision always double the size of scale), but I don't think that should impact what type greatest/least use.

@HyukjinKwon
Copy link
Member

Have you read this comment?

We are discussing this internally, can you hold it for a while? We may decide to increase the max precision to 76 and keep max scale as 38, then we don't have this problem.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #62999 has finished for PR 14389 at commit afca003.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #63000 has finished for PR 14389 at commit 929c39f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #63001 has finished for PR 14389 at commit 071b01d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Case 2 type widening (see the classdoc comment above for TypeCoercion).
*
* i.e. the main difference with [[findTightestCommonTypeOfTwo]] is that here we allow some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention the string promotion here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 29, 2016

@petermaxlee , after your patch, findTightestCommonTypeOfTwo is only used by BinaryOperator and JSON schema inference. I've checked all the BinaryOperator implementations. Except for something like And, BitwiseOr that don't accept decimal types, all other implementations will be handled in DecimalPrecision. This means, there is no expression needs the semantic of findTightestCommonTypeOfTwo: find the tightest common type of two types without precision loss. Can you check about JSON schema inference? If it doesn't need this semantic either, I think we can safely remove it and only use findWiderTypeForTwo.

@cloud-fan
Copy link
Contributor

and can you make sure that it's safe to use findWiderTypeForTwo for Nvl and Nvl2?

@petermaxlee
Copy link
Contributor Author

I don't believe the remaining use cases of findTightestCommonTypeOfTwo are necessary. That said, to get rid of those use cases would require more refactoring. I'm assuming we want to fix this issue in 2.0.1, so perhaps it is best to do the refactoring separately only for the master branch.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #63007 has finished for PR 14389 at commit b9f94fe.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@petermaxlee
Copy link
Contributor Author

And it's consistent to handle Nvl and Nvl2 the same way we handle other functions of similar kind (e.g. coalesce).

@petermaxlee
Copy link
Contributor Author

I ran the Python tests locally and they passed. It looks like the failure was caused by the Jenkins shutdown.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #3196 has finished for PR 14389 at commit b9f94fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2016

Test build #63034 has finished for PR 14389 at commit ffd1734.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

currently we have 3 different semantics for finding a wider type:

  1. findTightestCommonType: try to find a wider type, but skip decimal types
  2. findTightestCommonTypeToString: similar to rule 1, but if one side is string, it can promote the other side to string
  3. findWiderCommonType: similar to rule 2, but handles decimal type, and truncate it if necessary.

It makes sense to have 2 different semantics for string promotion, however, I don't think we need 2 different semantics for decimal types. There is no such a function that need its arguments to be same type, but can't accept precision loss for decimal type. I think it's better to run the query and log a warning message about the truncation, rather than fail the query.

We only need 2 semantics:

  1. findWiderType: try to find the wider type, including decimal type, and truncate if necessary
  2. findWiderTypeAndPromoteToString: similar to rule 1, but handles string promotion.

We also need to add some checks before applying the type widen rules, to avoid conflicting with DecimalPrecision, which defines some special rules for binary arithmetic about decimal type.

@petermaxlee in your PR, the string promotion semantic is hidden in findWiderType, and makes greatest and least accept string promotion, which is not expected. What do you think?

@petermaxlee
Copy link
Contributor Author

This depends on what we want for greatest/least. These two expressions do support string type as input.

@cloud-fan
Copy link
Contributor

for greatest(1, '2'), Hive, MySQL and Postgres will turn the string into double, instead of promoting the integer to string. This is the same behaviour with most of the arithmetic expressions, e.g. Add, Minus, Divide, etc. I think it makes sense and Spark SQL should follow it.

@petermaxlee
Copy link
Contributor Author

OK that makes sense!

@cloud-fan
Copy link
Contributor

hi @petermaxlee , are you going to update this?

@petermaxlee
Copy link
Contributor Author

Which part do you want me to update? I thought you've already committed the changes needed. Let me know and I will update this.

@cloud-fan
Copy link
Contributor

to refactor the type widen rules, i.e. we only need 2 rules: findWiderType and findWiderTypeAndPromoteToString.

@petermaxlee
Copy link
Contributor Author

Sorry that it has taken this long. I have submitted a work in progress pull request at #14696

Going to close this one and continue the work there, since it is a fairly different pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants