Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented May 8, 2024

What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes:

This is the first part of #45372

Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA. (wait for @sunchao to complete the 2.3.10 release to make jars visible on Maven Central)

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45372

@pan3793
Copy link
Member Author

pan3793 commented May 8, 2024

@sunchao please ping me after you finish the jar release :)

@dongjoon-hyun
Copy link
Member

Thank you, @pan3793 and @sunchao .

@dongjoon-hyun
Copy link
Member

Just a checkin. Is there any update, @pan3793 and @sunchao ?

@sunchao
Copy link
Member

sunchao commented May 9, 2024

@dongjoon-hyun @pan3793 I'm going to release it today.

@dongjoon-hyun
Copy link
Member

Great! Thank you so much, @sunchao .

@sunchao
Copy link
Member

sunchao commented May 9, 2024

It should have been released from Maven. Can you try this again?

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun
Copy link
Member

Could you rebase to the master branch once more, @pan3793 ?

leveldbjni-all/1.8//leveldbjni-all-1.8.jar
libfb303/0.9.3//libfb303-0.9.3.jar
libthrift/0.12.0//libthrift-0.12.0.jar
libthrift/0.16.0//libthrift-0.16.0.jar
Copy link
Member

@dongjoon-hyun dongjoon-hyun May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great news. Finally. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

Currently, Hive SerDes and UDFs are based on built-in Hive,
and Spark SQL can be connected to different versions of Hive Metastore
(from 0.12.0 to 2.3.9 and 3.0.0 to 3.1.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)).
(from 2.0.0 to 2.3.10 and 3.0.0 to 3.1.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing 0.12.0 together here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a leftover when we missed during dropping od Hive versions below 2.x.

Hive.getWithoutRegisterFns(hiveConf)
} catch {
// SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but
// SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.10 has it but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me revert this because we don't need to change this.

}

// Extract major.minor for testing Spark 3.1.x and 3.0.x with metastore 2.3.9 and Java 11.
// Extract major.minor for testing Spark 3.1.x and 3.0.x with metastore 2.3.10 and Java 11.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of updating this, it seems that we had better remove this comment because this is very outdated in many ways.

  • We don't testing Spark 3.1.x and 3.0.x
  • We don't test Java 11

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, simply let's revert from this PR to reduce diff size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review May 9, 2024 17:46
@dongjoon-hyun
Copy link
Member

cc @viirya , too

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47018][BUILD][SQL][HIVE] Bump built-in Hive to 2.3.10 [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 May 9, 2024
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 9, 2024

Hive UT failure is due to Google Maven Cache seems to be a little late to sync with Maven Central.

[info] - ADD JAR command 2 *** FAILED *** (154 milliseconds)
[info]   java.io.FileNotFoundException: https://maven-central.storage-download.googleapis.com/maven2/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.10/hive-hcatalog-core-2.3.10.jar

I can see the file in Maven Central.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

Hive 2.3.10 jars should be available on Google Maven Central Mirror now, re-triggered CI

@dongjoon-hyun
Copy link
Member

Thank you!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

Thank you for re-triggering the failed YARN test and PySpark tests in order to make it sure.

@dongjoon-hyun
Copy link
Member

Merged to master!

Thank you so much, @pan3793 and @sunchao .

From now, many people will use Hive 2.3.10. I believe we can build more confidence before Apache Spark 4.0.0 release.

@dongjoon-hyun
Copy link
Member

Also, cc @cloud-fan and @HyukjinKwon

This fixes not only Hive dependency but also a long standing libthrift library issue.

commons-crypto/1.1.0//commons-crypto-1.1.0.jar
commons-dbcp/1.4//commons-dbcp-1.4.jar
commons-io/2.16.1//commons-io-2.16.1.jar
commons-lang/2.6//commons-lang-2.6.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be too hasty or to have a bug.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @pan3793 .

It seems that we hit Maven/SBT dependency difference issue again.

Maven CIs fail after this PR due to commons-lang:2.6.

  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)

@dongjoon-hyun
Copy link
Member

I locally verified that the failure of HiveUDFDynamicLoadSuite is consistent.

$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
13:33:42.810 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55031/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 10, 2024

It turns out that Apache Spark is unable to support all legacy Hive UDF jar files after this PR. Let me make a follow-up because it's a breaking change we cannot endure. BTW, I'm not sure how SBT works, but Maven dependency management is more strict and correct always

dongjoon-hyun added a commit that referenced this pull request May 10, 2024
…ort legacy Hive UDF jars

### What changes were proposed in this pull request?

This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 .

### Why are the changes needed?

Recently, we dropped `commons-lang:commons-lang` during Hive upgrade.
- #46468

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing  UDF jars built against those versions requires `commons-lang:commons-lang` still.

- apache/hive#4892

For example, Apache Hive 3.1.3 code:
- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21
```
import org.apache.commons.lang.StringUtils;
```

- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42
```
return StringUtils.strip(val, " ");
```

As a result, Maven CIs are broken.
- https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17)
- https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21)

The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`.
```
HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...
```

### Does this PR introduce _any_ user-facing change?

To support the existing customer UDF jars.

### How was this patch tested?

Manually.

```
$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
...
HiveUDFDynamicLoadSuite:
14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0

14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1

14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF
14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF
14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF
14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src

14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF
14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it.

14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src

14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

Run completed in 8 seconds, 612 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

Closes #46528 from dongjoon-hyun/SPARK-48236.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@shrprasa
Copy link
Contributor

@pan3793 Libthrift upgrade to 0.16 is causing query failures on tables with large no. of partitions. Can you please check this issue reported in SPARK-49489 ?

@pan3793
Copy link
Member Author

pan3793 commented Feb 20, 2025

@shrprasa HIVE-26633 is missed in 2.3.10, unfortunately, Hive 2.3 is EOL, I don't think there's a chance we'll get an official fixed version.

Some approaches I thought could address the issue

  1. as @wangyum suggested, maintain an internal patched version of Hive 2.3 with HIVE-26633
  2. SPARK-49827 ([SPARK-49827][SQL] Fetching all partitions from hive metastore in batches #48337) to avoid producing large thrift frame (I would appreciate it if you, @shrprasa, can test whether it is helpful or not)
  3. investigate if we can use reflection or some bytecode-level tech on the Spark side to change the frame size threshold

LuciferYang pushed a commit that referenced this pull request Mar 3, 2025
…essage.size`

### What changes were proposed in this pull request?

Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive.

> Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb).

Note: it's a Hive configuration, I follow the convention to not document on the Spark side.

### Why are the changes needed?

1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB
2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1
3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3
4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size`
5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10

Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions.

See discussion in #46468 (comment)

### Does this PR introduce _any_ user-facing change?

No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly.

### How was this patch tested?

This must be tested manually, as the current Spark UT does not cover the remote HMS cases.

I constructed a test case in a testing Hadoop cluster with a remote HMS.

Firstly, create a table with a large number of partitions.
```
$ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \
    --conf spark.hive.exec.dynamic.partition.mode=nonstrict \
    --conf spark.hive.exec.max.dynamic.partitions=1000000
spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS
SELECT /*+ REPARTITION(200) */ * FROM (
  (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN
  (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN
  (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN
  (SELECT 'this is some data' AS data)
);
```

Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value.
```
$ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb
spark-sql (default)> SHOW PARTITIONS p;
...
2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames
org.apache.thrift.transport.TTransportException: MaxMessageSize reached
	at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10]
	at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
...
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50022 from pan3793/SPARK-49489.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang pushed a commit that referenced this pull request Mar 3, 2025
…essage.size`

### What changes were proposed in this pull request?

Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive.

> Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb).

Note: it's a Hive configuration, I follow the convention to not document on the Spark side.

### Why are the changes needed?

1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB
2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1
3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3
4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size`
5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10

Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions.

See discussion in #46468 (comment)

### Does this PR introduce _any_ user-facing change?

No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly.

### How was this patch tested?

This must be tested manually, as the current Spark UT does not cover the remote HMS cases.

I constructed a test case in a testing Hadoop cluster with a remote HMS.

Firstly, create a table with a large number of partitions.
```
$ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \
    --conf spark.hive.exec.dynamic.partition.mode=nonstrict \
    --conf spark.hive.exec.max.dynamic.partitions=1000000
spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS
SELECT /*+ REPARTITION(200) */ * FROM (
  (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN
  (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN
  (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN
  (SELECT 'this is some data' AS data)
);
```

Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value.
```
$ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb
spark-sql (default)> SHOW PARTITIONS p;
...
2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames
org.apache.thrift.transport.TTransportException: MaxMessageSize reached
	at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10]
	at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
...
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50022 from pan3793/SPARK-49489.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 2ea5621)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
senthh pushed a commit to acceldata-io/spark3 that referenced this pull request May 23, 2025
### What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes:

- due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`.
- remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892

This is the first part of apache#45372

### Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45372

Closes apache#46468 from pan3793/SPARK-47018.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 2d609bf)
shubhluck pushed a commit to acceldata-io/spark3 that referenced this pull request Sep 3, 2025
### What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes:

- due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`.
- remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892

This is the first part of apache#45372

### Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45372

Closes apache#46468 from pan3793/SPARK-47018.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 2d609bf)
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…pache#334)

* [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10

### What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes:

- due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`.
- remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in apache/hive#4892

This is the first part of apache#45372

### Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45372

Closes apache#46468 from pan3793/SPARK-47018.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 2d609bf)

* [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars

### What changes were proposed in this pull request?

This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 .

### Why are the changes needed?

Recently, we dropped `commons-lang:commons-lang` during Hive upgrade.
- apache#46468

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing  UDF jars built against those versions requires `commons-lang:commons-lang` still.

- apache/hive#4892

For example, Apache Hive 3.1.3 code:
- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21
```
import org.apache.commons.lang.StringUtils;
```

- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42
```
return StringUtils.strip(val, " ");
```

As a result, Maven CIs are broken.
- https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17)
- https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21)

The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`.
```
HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...
```

### Does this PR introduce _any_ user-facing change?

To support the existing customer UDF jars.

### How was this patch tested?

Manually.

```
$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
...
HiveUDFDynamicLoadSuite:
14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0

14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1

14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF
14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF
14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF
14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src

14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF
14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it.

14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src

14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

Run completed in 8 seconds, 612 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#46528 from dongjoon-hyun/SPARK-48236.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 5b3b8a9)

* fix

---------

Co-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
…essage.size`

### What changes were proposed in this pull request?

Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive.

> Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb).

Note: it's a Hive configuration, I follow the convention to not document on the Spark side.

### Why are the changes needed?

1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB
2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1
3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3
4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size`
5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10

Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions.

See discussion in apache#46468 (comment)

### Does this PR introduce _any_ user-facing change?

No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly.

### How was this patch tested?

This must be tested manually, as the current Spark UT does not cover the remote HMS cases.

I constructed a test case in a testing Hadoop cluster with a remote HMS.

Firstly, create a table with a large number of partitions.
```
$ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \
    --conf spark.hive.exec.dynamic.partition.mode=nonstrict \
    --conf spark.hive.exec.max.dynamic.partitions=1000000
spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS
SELECT /*+ REPARTITION(200) */ * FROM (
  (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN
  (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN
  (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN
  (SELECT 'this is some data' AS data)
);
```

Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value.
```
$ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb
spark-sql (default)> SHOW PARTITIONS p;
...
2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames
org.apache.thrift.transport.TTransportException: MaxMessageSize reached
	at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10]
	at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
...
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50022 from pan3793/SPARK-49489.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 37b3522)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LaurentGoderre pushed a commit to LaurentGoderre/spark that referenced this pull request Dec 11, 2025
…essage.size`

Partly port HIVE-26633 for Spark HMS client - respect `hive.thrift.client.max.message.size` if present and the value is positive.

> Thrift client configuration for max message size. 0 or -1 will use the default defined in the Thrift library. The upper limit is 2147483648 bytes (or 2gb).

Note: it's a Hive configuration, I follow the convention to not document on the Spark side.

1. THRIFT-5237 (0.14.0) changes the max thrift message size from 2GiB to 100MiB
2. HIVE-25098 (4.0.0) upgrades Thrift from 0.13.0 to 0.14.1
3. HIVE-25996 (2.3.10) backports HIVE-25098 to branch-2.3
4. HIVE-26633 (4.0.0) introduces `hive.thrift.client.max.message.size`
5. SPARK-47018 (4.0.0) upgrades Hive from 2.3.9 to 2.3.10

Thus, Spark's HMS client does not respect `hive.thrift.client.max.message.size` and has a fixed max thrift message size 100MiB, users may hit the "MaxMessageSize reached" exception on accessing Hive tables with a large number of partitions.

See discussion in apache#46468 (comment)

No, it tackles the regression introduced by an unreleased change, namely SPARK-47018. The added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly.

This must be tested manually, as the current Spark UT does not cover the remote HMS cases.

I constructed a test case in a testing Hadoop cluster with a remote HMS.

Firstly, create a table with a large number of partitions.
```
$ spark-sql --num-executors=6 --executor-cores=4 --executor-memory=1g \
    --conf spark.hive.exec.dynamic.partition.mode=nonstrict \
    --conf spark.hive.exec.max.dynamic.partitions=1000000
spark-sql (default)> CREATE TABLE p PARTITIONED BY (year, month, day) STORED AS PARQUET AS
SELECT /*+ REPARTITION(200) */ * FROM (
  (SELECT CAST(id AS STRING) AS year FROM range(2000, 2100)) JOIN
  (SELECT CAST(id AS STRING) AS month FROM range(1, 13)) JOIN
  (SELECT CAST(id AS STRING) AS day FROM range(1, 31)) JOIN
  (SELECT 'this is some data' AS data)
);
```

Then try to tune `hive.thrift.client.max.message.size` and run a query that would trigger `getPartitions` thrift call. For example, when set to `1kb`, it throws `TTransportException: MaxMessageSize reached`, and the exception disappears after boosting the value.
```
$ spark-sql --conf spark.hive.thrift.client.max.message.size=1kb
spark-sql (default)> SHOW PARTITIONS p;
...
2025-02-20 15:18:49 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. listPartitionNames
org.apache.thrift.transport.TTransportException: MaxMessageSize reached
	at org.apache.thrift.transport.TEndpointTransport.checkReadBytesAvailable(TEndpointTransport.java:81) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TProtocol.checkReadBytesAvailable(TProtocol.java:67) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.thrift.protocol.TBinaryProtocol.readListBegin(TBinaryProtocol.java:297) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result$get_partition_names_resultStandardScheme.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partition_names_result.read(ThriftHiveMetastore.java) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[libthrift-0.16.0.jar:0.16.0]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition_names(ThriftHiveMetastore.java:2458) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition_names(ThriftHiveMetastore.java:2443) ~[hive-metastore-2.3.10.jar:2.3.10]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionNames(HiveMetaStoreClient.java:1487) ~[hive-metastore-2.3.10.jar:2.3.10]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2349) ~[hive-metastore-2.3.10.jar:2.3.10]
	at jdk.proxy2/jdk.proxy2.$Proxy54.listPartitionNames(Unknown Source) ~[?:?]
	at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionNames(Hive.java:2461) ~[hive-exec-2.3.10-core.jar:2.3.10]
	at org.apache.spark.sql.hive.client.Shim_v2_0.getPartitionNames(HiveShim.scala:976) ~[spark-hive_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
...
```

No.

Closes apache#50022 from pan3793/SPARK-49489.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 2ea5621)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants