-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 #46468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sunchao please ping me after you finish the jar release :) |
|
@dongjoon-hyun @pan3793 I'm going to release it today. |
|
Great! Thank you so much, @sunchao . |
|
It should have been released from Maven. Can you try this again? |
|
Ya, I can see it. |
|
Could you rebase to the |
| leveldbjni-all/1.8//leveldbjni-all-1.8.jar | ||
| libfb303/0.9.3//libfb303-0.9.3.jar | ||
| libthrift/0.12.0//libthrift-0.12.0.jar | ||
| libthrift/0.16.0//libthrift-0.16.0.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great news. Finally. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good.
| Currently, Hive SerDes and UDFs are based on built-in Hive, | ||
| and Spark SQL can be connected to different versions of Hive Metastore | ||
| (from 0.12.0 to 2.3.9 and 3.0.0 to 3.1.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). | ||
| (from 2.0.0 to 2.3.10 and 3.0.0 to 3.1.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing 0.12.0 together here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a leftover when we missed during dropping od Hive versions below 2.x.
| Hive.getWithoutRegisterFns(hiveConf) | ||
| } catch { | ||
| // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but | ||
| // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.10 has it but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me revert this because we don't need to change this.
| } | ||
|
|
||
| // Extract major.minor for testing Spark 3.1.x and 3.0.x with metastore 2.3.9 and Java 11. | ||
| // Extract major.minor for testing Spark 3.1.x and 3.0.x with metastore 2.3.10 and Java 11. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of updating this, it seems that we had better remove this comment because this is very outdated in many ways.
- We don't testing Spark 3.1.x and 3.0.x
- We don't test Java 11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, simply let's revert from this PR to reduce diff size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
cc @viirya , too |
|
Hive UT failure is due to I can see the file in |
|
Hive 2.3.10 jars should be available on Google Maven Central Mirror now, re-triggered CI |
|
Thank you! |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs).
Thank you for re-triggering the failed YARN test and PySpark tests in order to make it sure.
|
Also, cc @cloud-fan and @HyukjinKwon This fixes not only Hive dependency but also a long standing |
| commons-crypto/1.1.0//commons-crypto-1.1.0.jar | ||
| commons-dbcp/1.4//commons-dbcp-1.4.jar | ||
| commons-io/2.16.1//commons-io-2.16.1.jar | ||
| commons-lang/2.6//commons-lang-2.6.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be too hasty or to have a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @pan3793 .
It seems that we hit Maven/SBT dependency difference issue again.
Maven CIs fail after this PR due to commons-lang:2.6.
- https://github.com/apache/spark/actions/runs/9032639456
- https://github.com/apache/spark/actions/runs/9033374547/attempts/1
java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
|
I locally verified that the failure of |
|
It turns out that Apache Spark is unable to support all legacy Hive UDF jar files after this PR. Let me make a follow-up because it's a breaking change we cannot endure. BTW, I'm not sure how SBT works, but Maven dependency management is more strict and correct always |
…ort legacy Hive UDF jars ### What changes were proposed in this pull request? This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 . ### Why are the changes needed? Recently, we dropped `commons-lang:commons-lang` during Hive upgrade. - #46468 However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires `commons-lang:commons-lang` still. - apache/hive#4892 For example, Apache Hive 3.1.3 code: - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21 ``` import org.apache.commons.lang.StringUtils; ``` - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42 ``` return StringUtils.strip(val, " "); ``` As a result, Maven CIs are broken. - https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17) - https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21) The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`. ``` HiveUDFDynamicLoadSuite: - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. *** RUN ABORTED *** A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185) ... Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) ... ``` ### Does this PR introduce _any_ user-facing change? To support the existing customer UDF jars. ### How was this patch tested? Manually. ``` $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test ... HiveUDFDynamicLoadSuite: 14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1 14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF 14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF 14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF 14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src 14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF 14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it. 14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src 14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist Run completed in 8 seconds, 612 milliseconds. Total number of tests run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? Closes #46528 from dongjoon-hyun/SPARK-48236. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
@pan3793 Libthrift upgrade to 0.16 is causing query failures on tables with large no. of partitions. Can you please check this issue reported in SPARK-49489 ? |
|
@shrprasa HIVE-26633 is missed in 2.3.10, unfortunately, Hive 2.3 is EOL, I don't think there's a chance we'll get an official fixed version. Some approaches I thought could address the issue
|
…essage.size` ### What changes were proposed in this pull request? Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. ### Why are the changes needed? 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in #46468 (comment) ### Does this PR introduce _any_ user-facing change? No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. ### How was this patch tested? This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>
…essage.size` ### What changes were proposed in this pull request? Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. ### Why are the changes needed? 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in #46468 (comment) ### Does this PR introduce _any_ user-facing change? No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. ### How was this patch tested? This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 2ea5621) Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892 This is the first part of apache#45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45372 Closes apache#46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d609bf)
### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892 This is the first part of apache#45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45372 Closes apache#46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d609bf)
…pache#334) * [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 ### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892 This is the first part of apache#45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45372 Closes apache#46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d609bf) * [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars ### What changes were proposed in this pull request? This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 . ### Why are the changes needed? Recently, we dropped `commons-lang:commons-lang` during Hive upgrade. - apache#46468 However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires `commons-lang:commons-lang` still. - apache/hive#4892 For example, Apache Hive 3.1.3 code: - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21 ``` import org.apache.commons.lang.StringUtils; ``` - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42 ``` return StringUtils.strip(val, " "); ``` As a result, Maven CIs are broken. - https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17) - https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21) The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`. ``` HiveUDFDynamicLoadSuite: - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. *** RUN ABORTED *** A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185) ... Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) ... ``` ### Does this PR introduce _any_ user-facing change? To support the existing customer UDF jars. ### How was this patch tested? Manually. ``` $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test ... HiveUDFDynamicLoadSuite: 14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1 14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF 14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF 14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF 14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src 14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF 14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it. 14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src 14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist Run completed in 8 seconds, 612 milliseconds. Total number of tests run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? Closes apache#46528 from dongjoon-hyun/SPARK-48236. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5b3b8a9) * fix --------- Co-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
…essage.size` ### What changes were proposed in this pull request? Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. ### Why are the changes needed? 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in apache#46468 (comment) ### Does this PR introduce _any_ user-facing change? No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. ### How was this patch tested? This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 37b3522) Signed-off-by: yangjie01 <yangjie01@baidu.com>
…essage.size` Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in apache#46468 (comment) No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` No. Closes apache#50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 2ea5621) Signed-off-by: yangjie01 <yangjie01@baidu.com>
What changes were proposed in this pull request?
This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes:
libthriftis upgraded from0.12to0.16.commons-lang:2.6, it comes from Hive transitive deps, Hive 2.3.10 drops it in HIVE-7145: [2.3] Remove dependence on apache commons-lang hive#4892This is the first part of #45372
Why are the changes needed?
Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GA. (wait for @sunchao to complete the 2.3.10 release to make jars visible on Maven Central)
Was this patch authored or co-authored using generative AI tooling?
No.
Closes #45372