Initial Support for Spark 4.0 preview #11257

huaxingao · 2024-10-04T16:44:45Z

No description provided.

huaxingao · 2024-10-08T21:13:55Z

.github/workflows/java-ci.yml

@@ -95,7 +95,7 @@ jobs:
    runs-on: ubuntu-22.04
    strategy:
      matrix:
-        jvm: [11, 17, 21]
+        jvm: [17, 21]


This is for build-checks. Not able to do build-checks using java 11 because Spark 4.0 is build using java 17.

huaxingao · 2024-10-08T21:14:41Z

.github/workflows/java-ci.yml

@@ -108,7 +108,7 @@ jobs:
    runs-on: ubuntu-22.04
    strategy:
      matrix:
-        jvm: [11, 17, 21]
+        jvm: [17, 21]


This is for build-javadoc. Not able to do build-javadoc using java 11 because Spark 4.0 is build using java 17.

Is there a reason to do build-javadoc on more than one JVM anyway?

huaxingao · 2024-10-08T22:18:33Z

CI for preview1 passed.
CI for preview2 failed.
Trying SNAPSHOT to see if some of the Spark issues in preview2 have been fixed in SNAPSHOT.

gradle/libs.versions.toml

RussellSpitzer · 2024-10-09T15:33:58Z

gradle/libs.versions.toml

@@ -137,6 +139,7 @@ hadoop2-common = { module = "org.apache.hadoop:hadoop-common", version.ref = "ha
 hadoop2-hdfs = { module = "org.apache.hadoop:hadoop-hdfs", version.ref = "hadoop2" }
 hadoop2-mapreduce-client-core = { module = "org.apache.hadoop:hadoop-mapreduce-client-core", version.ref = "hadoop2" }
 hadoop2-minicluster = { module = "org.apache.hadoop:hadoop-minicluster", version.ref = "hadoop2" }
+hadoop34-minicluster = { module = "org.apache.hadoop:hadoop-minicluster", version.ref = "hadoop34" }


after hadoop3

spark/v4.0/build.gradle

RussellSpitzer · 2024-10-09T15:38:58Z

...n/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSqlExtensionsAstBuilder.scala

@@ -30,7 +30,7 @@ import org.apache.iceberg.NullOrder
 import org.apache.iceberg.SortDirection
 import org.apache.iceberg.expressions.Term
 import org.apache.iceberg.spark.Spark3Util
-import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.analysis.IcebergAnalysisException


Why are we switching to our internal Exception class here?

Because Spark 4.0 no longer allows the construction of a new AnalysisException with just a string message. I actually have a separate PR for this. We can probably merge this change in Spark 3.5, although it's not strictly required there.

RussellSpitzer · 2024-10-09T15:43:00Z

...java/org/apache/iceberg/spark/extensions/TestSystemFunctionPushDownInRowLevelOperations.java

@@ -260,7 +267,12 @@ private void checkUpdate(RowLevelOperationMode mode, String cond) {
              DistributionMode.NONE.modeName());

          Dataset<Row> changeDF = spark.table(tableName).where(cond).limit(2).select("id");
-          changeDF.coalesce(1).writeTo(tableName(CHANGES_TABLE_NAME)).create();


Why are we rethrowing here?

I added the catch block to make the code compile for preview2. Preview1 doesn't need this. After switching to Preview2, I actually got a bunch of CI failures due to TableAlreadyExistsException in a few test suites. Preview1 works fine. I am still trying to figure out which change between Preview1 and Preview2 caused the behavior changes for TableAlreadyExistsException.

RussellSpitzer · 2024-10-09T15:48:54Z

spark/v4.0/spark/src/main/scala/org/apache/spark/sql/stats/ThetaSketchAgg.scala

+  def expr(node: ColumnNode): Expression = {
+    node match {
+      case ExpressionColumnNode(expression, _) => expression
+      case node => throw SparkException.internalError("Unsupported ColumnNode: " + node)


Should we be throwing a Spark Internal error here? Seems like this our issue?

I will change this to an Iceberg Exception. I am not making the change in this round because I want to try Preview 1 to see if the other changes can pass the CI. I will fix this later when I try Preview 2.

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

RussellSpitzer · 2024-10-09T15:51:09Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java

@@ -108,14 +108,14 @@ public static Object[][] parameters() {
        SparkCatalogConfig.SPARK.implementation(),
        SparkCatalogConfig.SPARK.properties(),
        PARQUET,
-        ImmutableMap.of(COMPRESSION_CODEC, "zstd", COMPRESSION_LEVEL, "1")


nit: We are only partially alphabetizing here

This is the kind of thing I do love but it should be

gzip
snappy
zstd

I would probably just skip this change for now and do it in another PR

The reason I wanted to ImmutableMap.of(COMPRESSION_CODEC, "zstd", COMPRESSION_LEVEL, "1") after gzip is that the new Hadoop version uses CompressionLevel to initialize a GzipCompressor, and this COMPRESSION_LEVEL, "1", is carried over to gzip. However, "1" is not a valid compression level for gzip, so it throws an exception.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 1 times, most recent failure: Lost task 1.0 in stage 6.0 (TID 7) (192.168.50.141 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.io.compress.zlib.ZlibCompressor.CompressionLevel.1 at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.hadoop.conf.Configuration.getEnum(Configuration.java:1786) at org.apache.hadoop.io.compress.zlib.ZlibFactory.getCompressionLevel(ZlibFactory.java:165) at org.apache.hadoop.io.compress.zlib.BuiltInGzipCompressor.init(BuiltInGzipCompressor.java:157) at org.apache.hadoop.io.compress.zlib.BuiltInGzipCompressor.<init>(BuiltInGzipCompressor.java:67) at org.apache.hadoop.io.compress.GzipCodec.createCompressor(GzipCodec.java:64) at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152) at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168) at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.<init>(CodecFactory.java:157) at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:219) at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:202) at org.apache.iceberg.parquet.ParquetWriter.<init>(ParquetWriter.java:90) at org.apache.iceberg.parquet.Parquet$WriteBuilder.build(Parquet.java:360)

I think this over; rather than switching the order, it's better to unset the COMPRESSION_CODEC and COMPRESSION_LEVEL for each test.

I probably should have a separate PR and fix this in Spark3.5 too. WDYT?

Yep sorry, didn't mean to bikeshed on this, obviously it's not important to this PR :)

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreaming.java

huaxingao · 2024-10-09T22:52:27Z

@RussellSpitzer Thanks for your review! I have addressed the comments and switched back to Preview1, along with reverting a few changes I made for Preview2/snapshot. I switched back to Preview1 to test if my changes can pass the CI, since I haven't made Preview2 work yet. Could you please take another look when you have time? Thanks!

RussellSpitzer · 2024-10-11T15:23:13Z

I'm ready for V2 :) Let me know when you have those changes up. I'm trying to review this on a per commit basis because the diff is so large :)

huaxingao · 2024-11-19T17:16:28Z

@RussellSpitzer There are some conflict files. If I rebase, it will also pick up changes for Spark3.5, so I opened a new PR. I will ping you for review after I resolve all the test failures in the new PR. THanks!

huaxingao added 2 commits September 27, 2024 13:21

Spark4.0: Move 3.5 as 4.0

88c8c6a

Spark 4.0: Copy back 4.0 as 3.5

7051e95

huaxingao marked this pull request as draft October 4, 2024 16:44

github-actions bot added spark INFRA build labels Oct 4, 2024

huaxingao force-pushed the spark_4.0_preview2 branch 4 times, most recently from bf0feef to 3d99252 Compare October 4, 2024 18:16

Spark 4.0: Initial support

f7dd73f

huaxingao force-pushed the spark_4.0_preview2 branch 2 times, most recently from d0dd067 to f2d9c5e Compare October 8, 2024 16:18

try preview-1 for now

a76373d

huaxingao force-pushed the spark_4.0_preview2 branch from e57a04a to a76373d Compare October 8, 2024 19:47

huaxingao changed the title ~~Initial Support for Spark 4.0 preview2~~ Initial Support for Spark 4.0 preview Oct 8, 2024

huaxingao commented Oct 8, 2024

View reviewed changes

huaxingao mentioned this pull request Oct 8, 2024

WIP: Initial Support for Spark 4.0 #10622

Closed

try spark 4.0 SNAPSHOT

90dbee7

huaxingao added 2 commits October 8, 2024 15:32

go back to preview1

c225079

try 4.0.0-SNAPSHOT

3e8132e

huaxingao force-pushed the spark_4.0_preview2 branch from d3d2cde to 3e8132e Compare October 9, 2024 01:57

RussellSpitzer reviewed Oct 9, 2024

View reviewed changes

gradle/libs.versions.toml Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 9, 2024

View reviewed changes

spark/v4.0/build.gradle Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 9, 2024

View reviewed changes

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 9, 2024

View reviewed changes

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreaming.java Show resolved Hide resolved

address comments using preview1

dcdaca0

huaxingao added 2 commits October 18, 2024 17:43

try preview2 again

1c0c7d7

use com.gradleup.shadow

97a2a66

github-actions bot added flink hive AWS GCP AZURE labels Oct 28, 2024

huaxingao force-pushed the spark_4.0_preview2 branch from 97a2a66 to 08417be Compare October 28, 2024 23:00

github-actions bot removed AWS GCP AZURE labels Oct 28, 2024

huaxingao force-pushed the spark_4.0_preview2 branch from 08417be to 97a2a66 Compare October 28, 2024 23:22

github-actions bot added AWS GCP AZURE labels Oct 28, 2024

manuzhang mentioned this pull request Nov 19, 2024

Spark: 4.0 snapshot support #11583

Draft

huaxingao closed this Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Support for Spark 4.0 preview #11257

Initial Support for Spark 4.0 preview #11257

huaxingao commented Oct 4, 2024

huaxingao Oct 8, 2024

huaxingao Oct 8, 2024

RussellSpitzer Oct 9, 2024

huaxingao commented Oct 8, 2024

RussellSpitzer Oct 9, 2024

huaxingao Oct 9, 2024

RussellSpitzer Oct 9, 2024

huaxingao Oct 9, 2024

RussellSpitzer Oct 9, 2024

huaxingao Oct 9, 2024

RussellSpitzer Oct 9, 2024

huaxingao Oct 9, 2024

RussellSpitzer Oct 9, 2024

huaxingao Oct 9, 2024

huaxingao Oct 9, 2024

RussellSpitzer Oct 11, 2024

huaxingao commented Oct 9, 2024

RussellSpitzer commented Oct 11, 2024

huaxingao commented Nov 19, 2024

Initial Support for Spark 4.0 preview #11257

Initial Support for Spark 4.0 preview #11257

Conversation

huaxingao commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Oct 9, 2024

RussellSpitzer commented Oct 11, 2024

huaxingao commented Nov 19, 2024