[SPARK-16228][SQL] HiveSessionCatalog should return `double`-param functions for decimal param lookups #13930

dongjoon-hyun · 2016-06-27T20:12:18Z

What changes were proposed in this pull request?

This PR supports a fallback lookup by casting DecimalType into DoubleType for the external functions with double-type parameter.

Reported Error Scenarios

scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  ; line 1 pos 7

scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7

How was this patch tested?

Pass the Jenkins tests (including a new testcase).

hvanhovell · 2016-06-27T20:22:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Shouldn't this be solved in the ExternalCatalog? This seems way to implementation specific to put in the Analyzer.

That sounds better. Thank you, @hvanhovell .
I'll move that.

You mean HiveSessionCatalog, right?

I am not sure. That seems like a good place though.

I am also curious why we use the function arguments (and not just the names) to resolve the function; this kinda robs us from the opportunity to cast the arguments.

dongjoon-hyun · 2016-06-27T21:24:51Z

Hi, @hvanhovell .
I updated this PR according to your comments.
Definitely, this issue was only about HiveSessionCatalog.
Thank you!

SparkQA · 2016-06-27T22:11:46Z

Test build #61323 has finished for PR 13930 at commit ef594f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-27T22:40:50Z

Test build #61327 has finished for PR 13930 at commit 8d6ce3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-27T22:51:06Z

Hi, @hvanhovell .
Could you review this PR again?

dongjoon-hyun · 2016-06-28T00:55:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

subLookupFunction seems not a good name. Any suitable convention in Spark?

lookupFunction0

rxin · 2016-06-28T02:06:08Z

ping @hvanhovell

hvanhovell · 2016-06-28T02:38:49Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

Shouldn't we just implement ExpectsInputTypes for hive udf/udaf/udft? Then the analyzer will insert the appropriate casts.

Thank you for advice, @hvanhovell .
Do you mean adding ExpectsInputTypes to HiveSimpleUDF, HiveGenericUDF, HiveUDAFFunction?
We only have 4 expressions to handle all generic Hive functions. So, currently, makeFunctionBuilder seems to type-checking by calling udf.dataType on the fly .

@dongjoon-hyun the current fix is quite brittle; this will fail again as soon as we pass in an argument with a slightly different value. The Analyzer will create casts to the proper type if we implement ExpectsInputTypes. So this seems like the best course of action. It might not be the easiest fix, or entirely possible; but I'd prefer to try this first.

Hi, @hvanhovell .
I tried again, but, as you saw in my first commit, this happens during resolving UnresolvedFunction.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L884

IMHO, we can not do this in ExpectsInputTypes.

yea i think the problem is that we don't register the hive function?

@rxin . Actually, we do createTempFunction for the hive function on the fly but with different signature (Decimal).
makeFunctionBuilder indeed uses children implicitly. That's the reason why I rename lookupFunction into subLookupFunction and repeats the same process with different children.

val builder = makeFunctionBuilder(functionName, className) // Put this Hive built-in function to our function registry. val info = new ExpressionInfo(className, functionName) createTempFunction(functionName, info, builder, ignoreIfExists = false)

I mean we need to call createTempFunction with double children instead of decimal children.

Oh, @rxin . I misunderstood your question. Yes. We don't register the hive function before.

@dongjoon-hyun Nevermind. We use the datatypes of the arguments passed to the HiveUDF/UDAF/UDFT to determine which object inspectors to use for conversion. So there is no way we can fix this using ExpectsInputTypes; sorry about the confusion...

We have only changed the default datatype for decimal conversion, so your I guess your fix is ok.

Thank you, @hvanhovell .

…ith `double`-type parameter only.

dongjoon-hyun · 2016-06-29T04:26:25Z

Rebased to the master for #13939 .

SparkQA · 2016-06-29T05:52:15Z

Test build #61446 has finished for PR 13930 at commit b8df028.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-29T20:17:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

+    try {
+      subLookupFunction(name, children)
+    } catch {
+      case _: Exception =>


catch nonfatal here, i.e.

case NonFatal(_) =>

dongjoon-hyun · 2016-06-29T20:30:34Z

Thank you, @rxin . The following is updated according to your advices.

Rename subLookupFunction with lookupFunction0.
Replace case _: Exception with case NonFatal(_).
Rename x with child and use brace for map.

SparkQA · 2016-06-29T22:08:34Z

Test build #61496 has finished for PR 13930 at commit 80cea2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-29T23:07:05Z

Thanks - merging in master/2.0.

…nctions for decimal param lookups ## What changes were proposed in this pull request? This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter. **Reported Error Scenarios** ```scala scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)") org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>) _FUNC_(bigint, double) ; line 1 pos 7 scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)") org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7 ``` ## How was this patch tested? Pass the Jenkins tests (including a new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13930 from dongjoon-hyun/SPARK-16228. (cherry picked from commit 2eaabfa) Signed-off-by: Reynold Xin <rxin@databricks.com>

dongjoon-hyun · 2016-06-29T23:12:15Z

Thank you, @rxin !

yhuai · 2016-09-30T01:58:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

+        }
+        lookupFunction0(name, newChildren)
+    }
+  }


@dongjoon-hyun What is the reason that we need to catch an exception instead of letting the analyzer do the job?

dongjoon-hyun · 2016-09-30T02:10:42Z

Hi. What analyzer do you mean? The exception cames from Hive Analyzer which considers DecimalType is different from double.

yhuai · 2016-09-30T02:19:48Z

oh, how is hive's analyze got involved at here? I am thinking that when we create the hive function's expression, we will know the expected input type of the function. Then, spark's analyzer will add the cast.

dongjoon-hyun · 2016-09-30T02:28:41Z

Yep. Right.
The following is the situation here.

There was a Hive function: double f().
Spark Analyzer makes a Hive function expression whose return type is DecimalType.
Hive Analyzer raises exception because there is not such a function, Decimal f().
In this PR, it makes a Hive function expression whose return type is double and retry.

This was the story of this PR at that time.

dongjoon-hyun · 2016-09-30T02:32:58Z

Hmm. Sorry, more specifically, it's not about Hive Analyzer. The exception is raised during Hive Catalog lookup.

yhuai · 2016-10-01T00:18:31Z

Do you still have the stack trace? Seems weird that it fails during the lookup.

dongjoon-hyun · 2016-10-01T00:22:31Z

It's three month ago. I don't have. You can reproduce by checking out the previous commit before this in master branch. I think that is the most correct way for you to reproduce the status.

yhuai · 2016-10-01T00:34:17Z

ok. If you get a chance, can you take a look? The solution at here looks pretty weird. I am wondering if we can move away from relying on try/catch. Thanks!

dongjoon-hyun · 2016-10-01T00:37:42Z

@yhuai . Very sorry. I don't have enough time for this. Why don't you create a Jira for investigation?

fabboe · 2016-11-18T16:03:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala

+    sql("select percentile(value, cast(0.5 as double)) from values 1,2,3 T(value)")
+    sql("select percentile_approx(value, cast(0.5 as double)) from values 1.0,2.0,3.0 T(value)")
+    sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
+    sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")


Doesn't cover all interfaces of percentile. E.g. Missing the case

sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")

still throws the seen error as outlined in https://issues.apache.org/jira/browse/SPARK-16228?focusedCommentId=15673869&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15673869

Should I report as bug? Thanks.

hvanhovell reviewed Jun 27, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-16228][SQL] Support a fallback lookup for external functions with double-type parameter only~~ [SPARK-16228][SQL] HiveSessionCatalog should return double-param functions for decimal param lookups Jun 27, 2016

dongjoon-hyun reviewed Jun 28, 2016
View reviewed changes

hvanhovell reviewed Jun 28, 2016
View reviewed changes

dongjoon-hyun added 2 commits June 28, 2016 21:12

[SPARK-16228][SQL] Support a fallback lookup for external functions w…

9dbde5b

…ith `double`-type parameter only.

Move fallback logic into HiveSessionCatalog.

b8df028

rxin reviewed Jun 29, 2016
View reviewed changes

Address comments.

80cea2e

asfgit closed this in 2eaabfa Jun 29, 2016

dongjoon-hyun deleted the SPARK-16228 branch July 20, 2016 07:41

yhuai reviewed Sep 30, 2016

View reviewed changes

fabboe reviewed Nov 18, 2016

View reviewed changes

hvanhovell mentioned this pull request Nov 28, 2016

[SPARK-18527][SQL] Convert decimal array to double array double for Hive UDAFPercentile #16034

Closed

ulysses-you mentioned this pull request Sep 17, 2020

[SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type #29749

Closed

PengleiShi mentioned this pull request Sep 14, 2024

[SPARK-37075][SQL] Move UDAF expression building from sql/catalyst to sql/core #34340

Closed

[SPARK-16228][SQL] HiveSessionCatalog should return double-param functions for decimal param lookups #13930

[SPARK-16228][SQL] HiveSessionCatalog should return double-param functions for decimal param lookups #13930

Uh oh!

Conversation

dongjoon-hyun commented Jun 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 27, 2016

Uh oh!

SparkQA commented Jun 27, 2016

Uh oh!

SparkQA commented Jun 27, 2016

Uh oh!

dongjoon-hyun commented Jun 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jun 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 29, 2016

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 29, 2016

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

rxin commented Jun 29, 2016

Uh oh!

dongjoon-hyun commented Jun 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 30, 2016

Uh oh!

yhuai commented Sep 30, 2016

Uh oh!

dongjoon-hyun commented Sep 30, 2016

Uh oh!

dongjoon-hyun commented Sep 30, 2016

Uh oh!

yhuai commented Oct 1, 2016

Uh oh!

dongjoon-hyun commented Oct 1, 2016

Uh oh!

yhuai commented Oct 1, 2016

[SPARK-16228][SQL] HiveSessionCatalog should return `double`-param functions for decimal param lookups #13930

[SPARK-16228][SQL] HiveSessionCatalog should return `double`-param functions for decimal param lookups #13930

dongjoon-hyun commented Jun 27, 2016 •

edited

Loading