-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16228][SQL] HiveSessionCatalog should return double-param functions for decimal param lookups
#13930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be solved in the ExternalCatalog? This seems way to implementation specific to put in the Analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds better. Thank you, @hvanhovell .
I'll move that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean HiveSessionCatalog, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. That seems like a good place though.
I am also curious why we use the function arguments (and not just the names) to resolve the function; this kinda robs us from the opportunity to cast the arguments.
|
Hi, @hvanhovell . |
double-type parameter onlydouble-param functions for decimal param lookups
|
Test build #61323 has finished for PR 13930 at commit
|
|
Test build #61327 has finished for PR 13930 at commit
|
|
Hi, @hvanhovell . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subLookupFunction seems not a good name. Any suitable convention in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookupFunction0
|
ping @hvanhovell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just implement ExpectsInputTypes for hive udf/udaf/udft? Then the analyzer will insert the appropriate casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for advice, @hvanhovell .
Do you mean adding ExpectsInputTypes to HiveSimpleUDF, HiveGenericUDF, HiveUDAFFunction?
We only have 4 expressions to handle all generic Hive functions. So, currently, makeFunctionBuilder seems to type-checking by calling udf.dataType on the fly .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun the current fix is quite brittle; this will fail again as soon as we pass in an argument with a slightly different value. The Analyzer will create casts to the proper type if we implement ExpectsInputTypes. So this seems like the best course of action. It might not be the easiest fix, or entirely possible; but I'd prefer to try this first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @hvanhovell .
I tried again, but, as you saw in my first commit, this happens during resolving UnresolvedFunction.
IMHO, we can not do this in ExpectsInputTypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think the problem is that we don't register the hive function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin . Actually, we do createTempFunction for the hive function on the fly but with different signature (Decimal).
makeFunctionBuilder indeed uses children implicitly. That's the reason why I rename lookupFunction into subLookupFunction and repeats the same process with different children.
val builder = makeFunctionBuilder(functionName, className)
// Put this Hive built-in function to our function registry.
val info = new ExpressionInfo(className, functionName)
createTempFunction(functionName, info, builder, ignoreIfExists = false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we need to call createTempFunction with double children instead of decimal children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, @rxin . I misunderstood your question. Yes. We don't register the hive function before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Nevermind. We use the datatypes of the arguments passed to the HiveUDF/UDAF/UDFT to determine which object inspectors to use for conversion. So there is no way we can fix this using ExpectsInputTypes; sorry about the confusion...
We have only changed the default datatype for decimal conversion, so your I guess your fix is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @hvanhovell .
…ith `double`-type parameter only.
|
Rebased to the master for #13939 . |
|
Test build #61446 has finished for PR 13930 at commit
|
| try { | ||
| subLookupFunction(name, children) | ||
| } catch { | ||
| case _: Exception => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catch nonfatal here, i.e.
case NonFatal(_) =>
|
Thank you, @rxin . The following is updated according to your advices.
|
|
Test build #61496 has finished for PR 13930 at commit
|
|
Thanks - merging in master/2.0. |
…nctions for decimal param lookups
## What changes were proposed in this pull request?
This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter.
**Reported Error Scenarios**
```scala
scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double) ; line 1 pos 7
scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7
```
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13930 from dongjoon-hyun/SPARK-16228.
(cherry picked from commit 2eaabfa)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
Thank you, @rxin ! |
| } | ||
| lookupFunction0(name, newChildren) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun What is the reason that we need to catch an exception instead of letting the analyzer do the job?
|
Hi. What analyzer do you mean? The exception cames from Hive Analyzer which considers DecimalType is different from double. |
|
oh, how is hive's analyze got involved at here? I am thinking that when we create the hive function's expression, we will know the expected input type of the function. Then, spark's analyzer will add the cast. |
|
Yep. Right.
This was the story of this PR at that time. |
|
Hmm. Sorry, more specifically, it's not about Hive Analyzer. The exception is raised during Hive Catalog lookup. |
|
Do you still have the stack trace? Seems weird that it fails during the lookup. |
|
It's three month ago. I don't have. You can reproduce by checking out the previous commit before this in master branch. I think that is the most correct way for you to reproduce the status. |
|
ok. If you get a chance, can you take a look? The solution at here looks pretty weird. I am wondering if we can move away from relying on try/catch. Thanks! |
|
@yhuai . Very sorry. I don't have enough time for this. Why don't you create a Jira for investigation? |
| sql("select percentile(value, cast(0.5 as double)) from values 1,2,3 T(value)") | ||
| sql("select percentile_approx(value, cast(0.5 as double)) from values 1.0,2.0,3.0 T(value)") | ||
| sql("select percentile(value, 0.5) from values 1,2,3 T(value)") | ||
| sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't cover all interfaces of percentile. E.g. Missing the case
sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")
still throws the seen error as outlined in https://issues.apache.org/jira/browse/SPARK-16228?focusedCommentId=15673869&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15673869
Should I report as bug? Thanks.
What changes were proposed in this pull request?
This PR supports a fallback lookup by casting
DecimalTypeintoDoubleTypefor the external functions withdouble-type parameter.Reported Error Scenarios
How was this patch tested?
Pass the Jenkins tests (including a new testcase).