-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type #29749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| children.map(_.dataType.toTypeInfo).asJava, methods.toSeq.asJava) | ||
| } | ||
| // if there exists many method, we choose the first | ||
| methods.head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, UDF just has one method of evaluate.
|
Test build #128642 has finished for PR 29749 at commit
|
|
Test build #128674 has finished for PR 29749 at commit
|
|
Test build #128696 has finished for PR 29749 at commit
|
|
|
||
| override def inputTypes: Seq[AbstractDataType] = { | ||
| val inTypes = children.map(_.dataType) | ||
| if (!inTypes.exists(_.existsRecursively(_.isInstanceOf[DecimalType]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need special handling for decimal types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a compatible issue. In normal case, data type is converted by Hive ObjectInspector at running time. But Hive not support input decimal type when method required double type. Unfortunately the default type of 1.1 is different between Spark and Hive, which are decimal and double. Then caused this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Could you describe it in the PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't get it. ImplicitCastInputTypes can implicit cast decimal to double, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In first commit I did it for all types but test not passed. The reason is a UDF required an Object type.
We convert Java Object Type to NullType in HiveInspectors.javaTypeToDataType (seems this can be changed ?)
// Hive seems to return this for struct types?
case c: Class[_] if c == classOf[java.lang.Object] => NullType
So we can't reflect the UDF method using a NullType, the error msg is:
in query: cannot resolve 'example_format('%o', 93)' due to data type mismatch: argument 2 requires array<null> type, however, '93' is of int type.; line 1 pos 7;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a hive udf requires Object, I think it means AnyDataType. We should only special case it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this ?
val expectTypes = method.getGenericParameterTypes.map(javaTypeToDataType)
if (expectTypes.exists(_.existsRecursively(_.isInstanceOf[NullType]))) {
children.map(_.dataType)
} else {
expectTypes
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method.getGenericParameterTypes.map(javaTypeToDataType).map { dt =>
if (dt.existsRecursively(_.isInstanceOf[NullType])) AnyDataType else dt
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we should check data type and replace NullType to AnyDataType on by one to avoid such case Map<Double, Object> input.
I will do an another check if we can change HiveInspectors.javaTypeToDataType that using AnyDataType directly.
|
also cc @cloud-fan @dongjoon-hyun the similar issue with #13930 |
| with Logging | ||
| with UserDefinedExpression { | ||
| with UserDefinedExpression | ||
| with ImplicitCastInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScalaUDF doesn't extend ImplicitCastInputTypes either. Does it have the same problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, ImplicitTypeCasts has checked ScalaUDF as case udf: ScalaUDF if udf.inputTypes.nonEmpty =>
| } | ||
|
|
||
| override def inputTypes: Seq[AbstractDataType] = { | ||
| val inTypes = children.map(_.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. The expected type should be defined by the function signature, but not the actual function inputs. What are we doing here?
As an example, ScalaUDF.inputTypes is derived from the function signature (captured by encoders).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ScalaUDF also implicit convert input type to expected type at ImplicitTypeCasts.
Let's say we have a udf Array[Double] => Double = { data => data.sum } and we run spark.sql("select udf(array(1.0, 1.1, 1.2))"). Then ImplicitTypeCasts will cast array<decimal> to array<double>.
But Hive udf can't enjoy it, now we only use Hive ObjectInspector to convert data type at running time. It's the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I misread the code. We get children data type only to skip decimal.
|
Test build #128879 has finished for PR 29749 at commit
|
|
My last concern is to check |
|
|
|
Seems there is somethings wrong, |
|
After checked Hive code, it's a Hive version issue. In Hive 2.3.x this pr will not happen since the HIVE-13380 merged. And what I said is based on Hive 1.2.x. An other thing is we cannot implicit cast input type because we need use children data type to reflect UDF method first and get exception if it fails. So we have no chance to go in Sorry for the mistake, seems it's not need to fix the issue. |
|
Thanks for the investigation! You can probably open a new PR to add comments for |
|
thanks @cloud-fan @maropu |
|
Hi, @ulysses-you and @cloud-fan and @maropu ? |
| } | ||
| } | ||
|
|
||
| test("SPARK-32877: Fix Hive UDF not support decimal type in complex type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not covered in any other test case, it looks worth of having this.
|
I'm wondering if this test fails with '-Phive1.2' still. |
|
@dongjoon-hyun I will open a new PR that only add test and some comment. |
What changes were proposed in this pull request?
This pr aims to support Hive UDF when input type is not expected. We make
HiveSimpleUDFextendsImplicitCastInputTypesthen usingAnalyzerto implicit cast input data type.Why are the changes needed?
It's a compatible issue. In normal case, data type is converted by Hive ObjectInspector at running time. But Hive not support input decimal type when method required double type. Unfortunately the default type of 1.1 is different between Spark and Hive, which are decimal and double. Then caused this issue.
Before this pr, we failed in this code.
Does this PR introduce any user-facing change?
Yes, user can get correct result.
How was this patch tested?
add test.