[SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 #46468

pan3793 · 2024-05-08T07:49:21Z

What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes:

due to API breaking changes of Thrift, libthrift is upgraded from 0.12 to 0.16.
remove version management of commons-lang:2.6, it comes from Hive transitive deps, Hive 2.3.10 drops it in HIVE-7145: [2.3] Remove dependence on apache commons-lang hive#4892

This is the first part of #45372

Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA. (wait for @sunchao to complete the 2.3.10 release to make jars visible on Maven Central)

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45372

pan3793 · 2024-05-08T09:05:06Z

@sunchao please ping me after you finish the jar release :)

dongjoon-hyun · 2024-05-08T14:28:34Z

Thank you, @pan3793 and @sunchao .

dongjoon-hyun · 2024-05-09T15:10:02Z

Just a checkin. Is there any update, @pan3793 and @sunchao ?

sunchao · 2024-05-09T15:28:39Z

@dongjoon-hyun @pan3793 I'm going to release it today.

dongjoon-hyun · 2024-05-09T15:35:13Z

Great! Thank you so much, @sunchao .

sunchao · 2024-05-09T16:09:59Z

It should have been released from Maven. Can you try this again?

dongjoon-hyun · 2024-05-09T17:36:25Z

Ya, I can see it.

https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.10/

dongjoon-hyun · 2024-05-09T17:37:26Z

Could you rebase to the master branch once more, @pan3793 ?

dongjoon-hyun · 2024-05-09T17:38:11Z

dev/deps/spark-deps-hadoop-3-hive-2.3

 leveldbjni-all/1.8//leveldbjni-all-1.8.jar
 libfb303/0.9.3//libfb303-0.9.3.jar
-libthrift/0.12.0//libthrift-0.12.0.jar
+libthrift/0.16.0//libthrift-0.16.0.jar


It's a great news. Finally. :)

This is good.

dongjoon-hyun · 2024-05-09T17:39:02Z

docs/sql-migration-guide.md

 Currently, Hive SerDes and UDFs are based on built-in Hive,
 and Spark SQL can be connected to different versions of Hive Metastore
-(from 0.12.0 to 2.3.9 and 3.0.0 to 3.1.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)).
+(from 2.0.0 to 2.3.10 and 3.0.0 to 3.1.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)).


Thank you for fixing 0.12.0 together here.

typo before?

Yes, this is a leftover when we missed during dropping od Hive versions below 2.x.

dongjoon-hyun · 2024-05-09T17:42:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

      Hive.getWithoutRegisterFns(hiveConf)
    } catch {
-      // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but
+      // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.10 has it but


Let me revert this because we don't need to change this.

dongjoon-hyun · 2024-05-09T17:44:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

      }

-      // Extract major.minor for testing Spark 3.1.x and 3.0.x with metastore 2.3.9 and Java 11.
+      // Extract major.minor for testing Spark 3.1.x and 3.0.x with metastore 2.3.10 and Java 11.


Instead of updating this, it seems that we had better remove this comment because this is very outdated in many ways.

We don't testing Spark 3.1.x and 3.0.x

We don't test Java 11

Or, simply let's revert from this PR to reduce diff size.

dongjoon-hyun · 2024-05-09T17:50:44Z

cc @viirya , too

dongjoon-hyun · 2024-05-09T19:19:02Z

Hive UT failure is due to Google Maven Cache seems to be a little late to sync with Maven Central.

[info] - ADD JAR command 2 *** FAILED *** (154 milliseconds)
[info]   java.io.FileNotFoundException: https://maven-central.storage-download.googleapis.com/maven2/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.10/hive-hcatalog-core-2.3.10.jar

I can see the file in Maven Central.

https://repo1.maven.org/maven2/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.10/hive-hcatalog-core-2.3.10.jar

pan3793 · 2024-05-10T01:55:50Z

Hive 2.3.10 jars should be available on Google Maven Central Mirror now, re-triggered CI

dongjoon-hyun · 2024-05-10T03:13:53Z

Thank you!

dongjoon-hyun

+1, LGTM (Pending CIs).

Thank you for re-triggering the failed YARN test and PySpark tests in order to make it sure.

dongjoon-hyun · 2024-05-10T04:33:00Z

Merged to master!

Thank you so much, @pan3793 and @sunchao .

From now, many people will use Hive 2.3.10. I believe we can build more confidence before Apache Spark 4.0.0 release.

dongjoon-hyun · 2024-05-10T04:33:46Z

Also, cc @cloud-fan and @HyukjinKwon

This fixes not only Hive dependency but also a long standing libthrift library issue.

dongjoon-hyun · 2024-05-10T19:25:08Z

dev/deps/spark-deps-hadoop-3-hive-2.3

 commons-crypto/1.1.0//commons-crypto-1.1.0.jar
 commons-dbcp/1.4//commons-dbcp-1.4.jar
 commons-io/2.16.1//commons-io-2.16.1.jar
-commons-lang/2.6//commons-lang-2.6.jar


This seems to be too hasty or to have a bug.

dongjoon-hyun

Hi, @pan3793 .

It seems that we hit Maven/SBT dependency difference issue again.

Maven CIs fail after this PR due to commons-lang:2.6.

  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)

dongjoon-hyun · 2024-05-10T20:35:16Z

I locally verified that the failure of HiveUDFDynamicLoadSuite is consistent.

$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
13:33:42.810 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55031/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...

dongjoon-hyun · 2024-05-10T20:54:00Z

It turns out that Apache Spark is unable to support all legacy Hive UDF jar files after this PR. Let me make a follow-up because it's a breaking change we cannot endure. BTW, I'm not sure how SBT works, but Maven dependency management is more strict and correct always

HIVE-7145: [2.3] Remove dependence on apache commons-lang hive#4892

…ort legacy Hive UDF jars ### What changes were proposed in this pull request? This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 . ### Why are the changes needed? Recently, we dropped `commons-lang:commons-lang` during Hive upgrade. - #46468 However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires `commons-lang:commons-lang` still. - apache/hive#4892 For example, Apache Hive 3.1.3 code: - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21 ``` import org.apache.commons.lang.StringUtils; ``` - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42 ``` return StringUtils.strip(val, " "); ``` As a result, Maven CIs are broken. - https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17) - https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21) The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`. ``` HiveUDFDynamicLoadSuite: - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. *** RUN ABORTED *** A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185) ... Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) ... ``` ### Does this PR introduce _any_ user-facing change? To support the existing customer UDF jars. ### How was this patch tested? Manually. ``` $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test ... HiveUDFDynamicLoadSuite: 14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1 14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF 14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF 14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF 14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src 14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF 14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it. 14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src 14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist Run completed in 8 seconds, 612 milliseconds. Total number of tests run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? Closes #46528 from dongjoon-hyun/SPARK-48236. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

shrprasa · 2025-02-20T08:26:04Z

@pan3793 Libthrift upgrade to 0.16 is causing query failures on tables with large no. of partitions. Can you please check this issue reported in SPARK-49489 ?

pan3793 · 2025-02-20T08:52:09Z

@shrprasa HIVE-26633 is missed in 2.3.10, unfortunately, Hive 2.3 is EOL, I don't think there's a chance we'll get an official fixed version.

Some approaches I thought could address the issue

as @wangyum suggested, maintain an internal patched version of Hive 2.3 with HIVE-26633
SPARK-49827 ([SPARK-49827][SQL] Fetching all partitions from hive metastore in batches #48337) to avoid producing large thrift frame (I would appreciate it if you, @shrprasa, can test whether it is helpful or not)
investigate if we can use reflection or some bytecode-level tech on the Spark side to change the frame size threshold

…essage.size` ### What changes were proposed in this pull request? Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. ### Why are the changes needed? 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in #46468 (comment) ### Does this PR introduce _any_ user-facing change? No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. ### How was this patch tested? This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…essage.size` ### What changes were proposed in this pull request? Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. ### Why are the changes needed? 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in #46468 (comment) ### Does this PR introduce _any_ user-facing change? No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. ### How was this patch tested? This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 2ea5621) Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892 This is the first part of apache#45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45372 Closes apache#46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d609bf)

…pache#334) * [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 ### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892 This is the first part of apache#45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45372 Closes apache#46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d609bf) * [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars ### What changes were proposed in this pull request? This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 . ### Why are the changes needed? Recently, we dropped `commons-lang:commons-lang` during Hive upgrade. - apache#46468 However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires `commons-lang:commons-lang` still. - apache/hive#4892 For example, Apache Hive 3.1.3 code: - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21 ``` import org.apache.commons.lang.StringUtils; ``` - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42 ``` return StringUtils.strip(val, " "); ``` As a result, Maven CIs are broken. - https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17) - https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21) The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`. ``` HiveUDFDynamicLoadSuite: - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. *** RUN ABORTED *** A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185) ... Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) ... ``` ### Does this PR introduce _any_ user-facing change? To support the existing customer UDF jars. ### How was this patch tested? Manually. ``` $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test ... HiveUDFDynamicLoadSuite: 14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1 14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF 14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF 14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF 14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src 14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF 14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it. 14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src 14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist Run completed in 8 seconds, 612 milliseconds. Total number of tests run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? Closes apache#46528 from dongjoon-hyun/SPARK-48236. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5b3b8a9) * fix --------- Co-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com>

…essage.size` ### What changes were proposed in this pull request? Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. ### Why are the changes needed? 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in apache#46468 (comment) ### Does this PR introduce _any_ user-facing change? No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. ### How was this patch tested? This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 37b3522) Signed-off-by: yangjie01 <yangjie01@baidu.com>

…essage.size` Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive. > Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb). Note: it's a Hive configuration, I follow the convention to not document on the Spark side. 1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB 2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1 3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3 4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size` 5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10 Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions. See discussion in apache#46468 (comment) No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly. This must be tested manually, as the current Spark UT does not cover the remote HMS cases. I constructed a test case in a testing Hadoop cluster with a remote HMS. Firstly, create a table with a large number of partitions. ``` $ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \ --conf spark.hive.exec.dynamic.partition.mode=nonstrict \ --conf spark.hive.exec.max.dynamic.partitions=1000000 spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS SELECT /*+ REPARTITION(200) */ * FROM ( (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN (SELECT 'this is some data' AS data) ); ``` Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value. ``` $ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb spark-sql (default)> SHOW PARTITIONS p; ... 2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames org.apache.thrift.transport.TTransportException: MaxMessageSize reached at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10] at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?] at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10] at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ... ``` No. Closes apache#50022 from pan3793/SPARK-49489. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 2ea5621) Signed-off-by: yangjie01 <yangjie01@baidu.com>

github-actions bot added SQL BUILD DOCS DSTREAM labels May 8, 2024

[SPARK-47018][BUILD][SQL][HIVE] Bump built-in Hive to 2.3.10

f976ea2

pan3793 force-pushed the SPARK-47018 branch from a01e789 to f976ea2 Compare May 8, 2024 07:54

dongjoon-hyun reviewed May 9, 2024

View reviewed changes

Update HiveClientImpl.scala

88bcd90

dongjoon-hyun reviewed May 9, 2024

View reviewed changes

dongjoon-hyun marked this pull request as ready for review May 9, 2024 17:46

dongjoon-hyun changed the title ~~[SPARK-47018][BUILD][SQL][HIVE] Bump built-in Hive to 2.3.10~~ [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 May 9, 2024

nit

a01dff6

dongjoon-hyun approved these changes May 10, 2024

View reviewed changes

dongjoon-hyun closed this in 2d609bf May 10, 2024

dongjoon-hyun reviewed May 10, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request May 10, 2024

[SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies #46521

Closed

dongjoon-hyun mentioned this pull request May 10, 2024

[SPARK-48236][BUILD] Add commons-lang:commons-lang:2.6 back to support legacy Hive UDF jars #46528

Closed

jamie-albert mentioned this pull request Dec 20, 2024

spark-3.5-scala-2.13/GHSA-g2fg-mr77-6vrm advisory update wolfi-dev/advisories#10931

Merged

pan3793 mentioned this pull request Feb 20, 2025

[SPARK-49489][SQL][HIVE] HMS client respects hive.thrift.client.maxmessage.size #50022

Closed

LaurentGoderre mentioned this pull request Dec 10, 2025

backport libthrift update to fix CVE-2019-0205 & CVE-2020-13949 #53429

Open

[SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 #46468

[SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 #46468

Uh oh!

Conversation

pan3793 commented May 8, 2024 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented May 8, 2024

Uh oh!

dongjoon-hyun commented May 8, 2024

Uh oh!

dongjoon-hyun commented May 9, 2024

Uh oh!

sunchao commented May 9, 2024

Uh oh!

dongjoon-hyun commented May 9, 2024

Uh oh!

sunchao commented May 9, 2024

Uh oh!

dongjoon-hyun commented May 9, 2024

Uh oh!

dongjoon-hyun commented May 9, 2024

Uh oh!

dongjoon-hyun May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 9, 2024

Uh oh!

dongjoon-hyun commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented May 10, 2024

Uh oh!

dongjoon-hyun commented May 10, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 10, 2024

Uh oh!

dongjoon-hyun commented May 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 10, 2024

Uh oh!

dongjoon-hyun commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrprasa commented Feb 20, 2025

Uh oh!

pan3793 commented Feb 20, 2025

Uh oh!

Reviewers

pan3793 commented May 8, 2024 •

edited by dongjoon-hyun

Loading

dongjoon-hyun May 9, 2024 •

edited

Loading

dongjoon-hyun commented May 9, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented May 10, 2024 •

edited

Loading