Spark 3.5: Structured Streaming read limit support follow-up #12260

wypoon · 2025-02-13T23:36:49Z

This fixes the TODO in #4479.
Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset(Offset, ReadLimit). In testing this, a bug was found in SparkMicroBatchStream::getDefaultReadLimit() and fixed.

Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug.

wypoon · 2025-02-14T00:50:42Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

-      readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
+      readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);


Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!

wypoon · 2025-02-14T00:52:27Z

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java

  @TestTemplate
-  public void testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfFiles_1()
-      throws Exception {
+  public void testReadStreamWithMaxFiles1() throws Exception {


I renamed a few tests to be more concise. The old names were unwieldy and also not conforming to Java style.

wypoon · 2025-02-14T00:54:50Z

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java

+    assertThat(
+            microBatchCount(
+                ImmutableMap.of(
+                    SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1",
+                    SparkReadOptions.STREAMING_MAX_ROWS_PER_MICRO_BATCH, "2")))
+        .isEqualTo(6);


This fails without the fix to SparkMicroBatchStream::getDefaultReadLimit(), as Spark then calls SparkMicroBatchStream::latestOffset(Offset, ReadLimit) with a CompositeReadLimit where one of the ReadLimits is a ReadMaxRows(1).

wypoon · 2025-02-14T00:56:22Z

@singhpk234 @jackye1995 @RussellSpitzer this is a small fix; can you please review?

singhpk234

Mostly LGTM with a minor suggestion, Thanks @wypoon !

singhpk234 · 2025-02-14T05:38:06Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

+      for (int i = 0; i < limits.length; i++) {
+        ReadLimit limit = limits[i];
+        if (limit instanceof ReadMaxFiles) {
+          return ((ReadMaxFiles) limit).maxFiles();
+        }
+      }


[minor] can we use this ?

Suggested change

for (int i = 0; i < limits.length; i++) {

ReadLimit limit = limits[i];

if (limit instanceof ReadMaxFiles) {

return ((ReadMaxFiles) limit).maxFiles();

}

}

for (ReadLimit limit: limits) {

if (limit instanceof ReadMaxFiles) {

return ((ReadMaxFiles) limit).maxFiles();

}

}

singhpk234 · 2025-02-14T05:41:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

-      readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
+      readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);


Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!

wypoon · 2025-02-14T18:06:52Z

Thanks @singhpk234.

singhpk234

LGTM, Thanks @wypoon !

wypoon · 2025-02-18T21:28:11Z

@jackye1995 @RussellSpitzer can you please review?

wypoon · 2025-03-06T05:28:16Z

@szehon-ho @aokolnychyi would you mind reviewing this?

wypoon · 2025-03-12T01:40:02Z

@RussellSpitzer would you mind reviewing this when you have some time? It is a small change which @singhpk234 has already reviewed and approved.

wypoon · 2025-03-25T15:46:05Z

@huaxingao would you mind reviewing this, since you're a Spark expert? It's a small change.

wypoon · 2025-04-21T05:07:32Z

@RussellSpitzer @huaxingao can you please review this?

sririshindra

LGTM

huaxingao · 2025-05-17T05:15:50Z

@wypoon Thanks for the PR — the changes look good to me. I have a question about the tests. It seems that a test like testReadStreamWithMaxRows2() would pass with both the original implementation (using readConf.maxRecordsPerMicroBatch()) and the new logic (using getMaxRows(readLimit)), since both return the same static value.
Shall we add a test that would fail under the old implementation but pass with the new one?

wypoon · 2025-05-17T18:52:28Z

Hi @huaxingao, thank you for reviewing this!
You are correct that the tests would pass with the original implementation, except for testReadStreamWithCompositeReadLimit, which would fail due to the bug in SparkMicroBatchStream::getDefaultReadLimit().
As I understand it, in SparkMicroBatchStream, we implement getDefaultReadLimit() (using what is set by configuration options), and Spark calls latestOffset(Offset, ReadLimit) with that ReadLimit. In principle, Spark can call latestOffset(Offset, ReadLimit) with any ReadLimit and SparkMicroBatchStream should respond according to that ReadLimit, but if Spark calls latestOffset(Offset, ReadLimit) with the ReadLimit given by getDefaultReadLimit(), then there is no difference in behavior between the original implementation (where we ignore the ReadLimit passed in and just use the ReadLimit corresponding to the configuration options) and the one in this PR. However, technically, we should not assume that and instead use the ReadLimit passed in.
Do you know if and how Spark would pass in a different ReadLimit than what is returned by getDefaultReadLimit()?

huaxingao · 2025-05-18T21:58:19Z

@wypoon I looked into the Spark side and didn’t see a way to dynamically change the ReadLimit. So it seems there’s no easy way to write a regression test that fails without the fix and passes with it. I’ll go ahead and approve the PR and leave it open for a couple of days in case anyone else wants to review it.

wypoon · 2025-05-19T16:17:12Z

Thanks @huaxingao!
This PR predates the Spark 4.0 support and I opened #13095 to port it to Spark 4.0, rather than update this PR.

huaxingao · 2025-05-20T20:45:04Z

Merged. Thanks @wypoon for the PR! Thanks @singhpk234 @sririshindra for reviewing!

backports #12260 to Spark 3.4

…12260) * Spark: Structured Streaming read limit support follow-up Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug. * Use enhanced for loop.

…13099) backports apache#12260 to Spark 3.4

Spark: Structured Streaming read limit support follow-up

5f2582f

Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug.

github-actions bot added the spark label Feb 13, 2025

wypoon commented Feb 14, 2025

View reviewed changes

singhpk234 reviewed Feb 14, 2025

View reviewed changes

Use enhanced for loop.

7afd085

singhpk234 approved these changes Feb 16, 2025

View reviewed changes

singhpk234 mentioned this pull request Apr 21, 2025

[bugfix] Fix composite read limit configuration #12854

Closed

sririshindra approved these changes Apr 29, 2025

View reviewed changes

huaxingao approved these changes May 18, 2025

View reviewed changes

wypoon mentioned this pull request May 19, 2025

Spark 4.0: Structured Streaming read limit support follow-up #13095

Merged

wypoon changed the title ~~Spark: Structured Streaming read limit support follow-up~~ Spark 3.5: Structured Streaming read limit support follow-up May 19, 2025

wypoon mentioned this pull request May 19, 2025

Spark 3.4: Structured Streaming read limit support follow-up #13099

Merged

huaxingao merged commit 8a38f5a into apache:main May 20, 2025
31 checks passed

pvary pushed a commit that referenced this pull request May 21, 2025

Spark 3.4: Structured Streaming read limit support follow-up (#13099)

6550486

backports #12260 to Spark 3.4

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Spark 3.4: Structured Streaming read limit support follow-up (apache#…

b569f0e

…13099) backports apache#12260 to Spark 3.4

		readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
		readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);

Spark 3.5: Structured Streaming read limit support follow-up #12260

Spark 3.5: Structured Streaming read limit support follow-up #12260

Uh oh!

Conversation

wypoon commented Feb 13, 2025

Uh oh!

wypoon Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon commented Feb 14, 2025

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon commented Feb 14, 2025

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

wypoon commented Feb 18, 2025

Uh oh!

wypoon commented Mar 6, 2025

Uh oh!

wypoon commented Mar 12, 2025

Uh oh!

wypoon commented Mar 25, 2025

Uh oh!

wypoon commented Apr 21, 2025

Uh oh!

sririshindra left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented May 17, 2025

Uh oh!

wypoon commented May 17, 2025

Uh oh!

huaxingao commented May 18, 2025

Uh oh!

wypoon commented May 19, 2025

Uh oh!

Uh oh!

huaxingao commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants