Commit 4b9b246
[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation
### What changes were proposed in this pull request?
Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs
### Why are the changes needed?
Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500).
Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs.
```
org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
at java.base/java.lang.Class.getConstructor0(Class.java:3578)
at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
...
```
Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.
### Does this PR introduce _any_ user-facing change?
No, except for a small perf improvement on the first call Hive UDF.
### How was this patch tested?
Pass GHA to ensure the porting code is correct.
Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>`
```
$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)
// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)
// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)
// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)
// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3 0
3 0
3 0
3 1
3 1
3 1
3 2
3 2
3 2
Time taken: 0.19 seconds, Fetched 9 row(s)
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #50232 from pan3793/eliminate-hive-udf-init.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>1 parent d965217 commit 4b9b246
File tree
10 files changed
+669
-20
lines changed- project
- sql
- hive-thriftserver
- src
- main/java/org/apache/hive/service/cli/session
- test/resources
- hive/src/main
- java/org/apache/hadoop/hive/ql
- exec
- udf/generic
- scala/org/apache/spark/sql/hive
10 files changed
+669
-20
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
412 | 412 | | |
413 | 413 | | |
414 | 414 | | |
| 415 | + | |
| 416 | + | |
415 | 417 | | |
416 | 418 | | |
417 | 419 | | |
| |||
1203 | 1205 | | |
1204 | 1206 | | |
1205 | 1207 | | |
| 1208 | + | |
| 1209 | + | |
| 1210 | + | |
| 1211 | + | |
| 1212 | + | |
| 1213 | + | |
| 1214 | + | |
| 1215 | + | |
1206 | 1216 | | |
1207 | 1217 | | |
1208 | 1218 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | 151 | | |
162 | 152 | | |
163 | 153 | | |
| |||
Lines changed: 9 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
673 | 673 | | |
674 | 674 | | |
675 | 675 | | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
676 | 680 | | |
677 | 681 | | |
678 | 682 | | |
679 | 683 | | |
680 | | - | |
| 684 | + | |
681 | 685 | | |
682 | 686 | | |
683 | 687 | | |
684 | 688 | | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
685 | 693 | | |
686 | 694 | | |
687 | 695 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
0 commit comments