Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 27, 2016

What changes were proposed in this pull request?

This PR supports a fallback lookup by casting DecimalType into DoubleType for the external functions with double-type parameter.

Reported Error Scenarios

scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  ; line 1 pos 7

scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7

How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be solved in the ExternalCatalog? This seems way to implementation specific to put in the Analyzer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds better. Thank you, @hvanhovell .
I'll move that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean HiveSessionCatalog, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. That seems like a good place though.

I am also curious why we use the function arguments (and not just the names) to resolve the function; this kinda robs us from the opportunity to cast the arguments.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
I updated this PR according to your comments.
Definitely, this issue was only about HiveSessionCatalog.
Thank you!

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16228][SQL] Support a fallback lookup for external functions with double-type parameter only [SPARK-16228][SQL] HiveSessionCatalog should return double-param functions for decimal param lookups Jun 27, 2016
@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61323 has finished for PR 13930 at commit ef594f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61327 has finished for PR 13930 at commit 8d6ce3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Could you review this PR again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subLookupFunction seems not a good name. Any suitable convention in Spark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookupFunction0

@rxin
Copy link
Contributor

rxin commented Jun 28, 2016

ping @hvanhovell

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just implement ExpectsInputTypes for hive udf/udaf/udft? Then the analyzer will insert the appropriate casts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for advice, @hvanhovell .
Do you mean adding ExpectsInputTypes to HiveSimpleUDF, HiveGenericUDF, HiveUDAFFunction?
We only have 4 expressions to handle all generic Hive functions. So, currently, makeFunctionBuilder seems to type-checking by calling udf.dataType on the fly .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun the current fix is quite brittle; this will fail again as soon as we pass in an argument with a slightly different value. The Analyzer will create casts to the proper type if we implement ExpectsInputTypes. So this seems like the best course of action. It might not be the easiest fix, or entirely possible; but I'd prefer to try this first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @hvanhovell .
I tried again, but, as you saw in my first commit, this happens during resolving UnresolvedFunction.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L884

IMHO, we can not do this in ExpectsInputTypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think the problem is that we don't register the hive function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin . Actually, we do createTempFunction for the hive function on the fly but with different signature (Decimal).
makeFunctionBuilder indeed uses children implicitly. That's the reason why I rename lookupFunction into subLookupFunction and repeats the same process with different children.

          val builder = makeFunctionBuilder(functionName, className)
          // Put this Hive built-in function to our function registry.
          val info = new ExpressionInfo(className, functionName)
          createTempFunction(functionName, info, builder, ignoreIfExists = false)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we need to call createTempFunction with double children instead of decimal children.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @rxin . I misunderstood your question. Yes. We don't register the hive function before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Nevermind. We use the datatypes of the arguments passed to the HiveUDF/UDAF/UDFT to determine which object inspectors to use for conversion. So there is no way we can fix this using ExpectsInputTypes; sorry about the confusion...

We have only changed the default datatype for decimal conversion, so your I guess your fix is ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @hvanhovell .

@dongjoon-hyun
Copy link
Member Author

Rebased to the master for #13939 .

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61446 has finished for PR 13930 at commit b8df028.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

try {
subLookupFunction(name, children)
} catch {
case _: Exception =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch nonfatal here, i.e.

case NonFatal(_) =>

@dongjoon-hyun
Copy link
Member Author

Thank you, @rxin . The following is updated according to your advices.

  • Rename subLookupFunction with lookupFunction0.
  • Replace case _: Exception with case NonFatal(_).
  • Rename x with child and use brace for map.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61496 has finished for PR 13930 at commit 80cea2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 29, 2016

Thanks - merging in master/2.0.

@asfgit asfgit closed this in 2eaabfa Jun 29, 2016
asfgit pushed a commit that referenced this pull request Jun 29, 2016
…nctions for decimal param lookups

## What changes were proposed in this pull request?

This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter.

**Reported Error Scenarios**
```scala
scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  ; line 1 pos 7

scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7
```

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13930 from dongjoon-hyun/SPARK-16228.

(cherry picked from commit 2eaabfa)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@dongjoon-hyun
Copy link
Member Author

Thank you, @rxin !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16228 branch July 20, 2016 07:41
}
lookupFunction0(name, newChildren)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun What is the reason that we need to catch an exception instead of letting the analyzer do the job?

@dongjoon-hyun
Copy link
Member Author

Hi. What analyzer do you mean? The exception cames from Hive Analyzer which considers DecimalType is different from double.

@yhuai
Copy link
Contributor

yhuai commented Sep 30, 2016

oh, how is hive's analyze got involved at here? I am thinking that when we create the hive function's expression, we will know the expected input type of the function. Then, spark's analyzer will add the cast.

@dongjoon-hyun
Copy link
Member Author

Yep. Right.
The following is the situation here.

  1. There was a Hive function: double f().
  2. Spark Analyzer makes a Hive function expression whose return type is DecimalType.
  3. Hive Analyzer raises exception because there is not such a function, Decimal f().
  4. In this PR, it makes a Hive function expression whose return type is double and retry.

This was the story of this PR at that time.

@dongjoon-hyun
Copy link
Member Author

Hmm. Sorry, more specifically, it's not about Hive Analyzer. The exception is raised during Hive Catalog lookup.

@yhuai
Copy link
Contributor

yhuai commented Oct 1, 2016

Do you still have the stack trace? Seems weird that it fails during the lookup.

@dongjoon-hyun
Copy link
Member Author

It's three month ago. I don't have. You can reproduce by checking out the previous commit before this in master branch. I think that is the most correct way for you to reproduce the status.

@yhuai
Copy link
Contributor

yhuai commented Oct 1, 2016

ok. If you get a chance, can you take a look? The solution at here looks pretty weird. I am wondering if we can move away from relying on try/catch. Thanks!

@dongjoon-hyun
Copy link
Member Author

@yhuai . Very sorry. I don't have enough time for this. Why don't you create a Jira for investigation?

sql("select percentile(value, cast(0.5 as double)) from values 1,2,3 T(value)")
sql("select percentile_approx(value, cast(0.5 as double)) from values 1.0,2.0,3.0 T(value)")
sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't cover all interfaces of percentile. E.g. Missing the case

sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")

still throws the seen error as outlined in https://issues.apache.org/jira/browse/SPARK-16228?focusedCommentId=15673869&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15673869

Should I report as bug? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants