-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 3.5: Structured Streaming read limit support follow-up #12260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug.
| readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch); | ||
| readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!
| @TestTemplate | ||
| public void testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfFiles_1() | ||
| throws Exception { | ||
| public void testReadStreamWithMaxFiles1() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed a few tests to be more concise. The old names were unwieldy and also not conforming to Java style.
| assertThat( | ||
| microBatchCount( | ||
| ImmutableMap.of( | ||
| SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1", | ||
| SparkReadOptions.STREAMING_MAX_ROWS_PER_MICRO_BATCH, "2"))) | ||
| .isEqualTo(6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails without the fix to SparkMicroBatchStream::getDefaultReadLimit(), as Spark then calls SparkMicroBatchStream::latestOffset(Offset, ReadLimit) with a CompositeReadLimit where one of the ReadLimits is a ReadMaxRows(1).
|
@singhpk234 @jackye1995 @RussellSpitzer this is a small fix; can you please review? |
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM with a minor suggestion, Thanks @wypoon !
| for (int i = 0; i < limits.length; i++) { | ||
| ReadLimit limit = limits[i]; | ||
| if (limit instanceof ReadMaxFiles) { | ||
| return ((ReadMaxFiles) limit).maxFiles(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] can we use this ?
| for (int i = 0; i < limits.length; i++) { | |
| ReadLimit limit = limits[i]; | |
| if (limit instanceof ReadMaxFiles) { | |
| return ((ReadMaxFiles) limit).maxFiles(); | |
| } | |
| } | |
| for (ReadLimit limit: limits) { | |
| if (limit instanceof ReadMaxFiles) { | |
| return ((ReadMaxFiles) limit).maxFiles(); | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adopted.
| readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch); | ||
| readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!
|
Thanks @singhpk234. |
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @wypoon !
|
@jackye1995 @RussellSpitzer can you please review? |
|
@szehon-ho @aokolnychyi would you mind reviewing this? |
|
@RussellSpitzer would you mind reviewing this when you have some time? It is a small change which @singhpk234 has already reviewed and approved. |
|
@huaxingao would you mind reviewing this, since you're a Spark expert? It's a small change. |
|
@RussellSpitzer @huaxingao can you please review this? |
sririshindra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@wypoon Thanks for the PR — the changes look good to me. I have a question about the tests. It seems that a test like testReadStreamWithMaxRows2() would pass with both the original implementation (using readConf.maxRecordsPerMicroBatch()) and the new logic (using getMaxRows(readLimit)), since both return the same static value. |
|
Hi @huaxingao, thank you for reviewing this! |
|
@wypoon I looked into the Spark side and didn’t see a way to dynamically change the |
|
Thanks @huaxingao! |
|
Merged. Thanks @wypoon for the PR! Thanks @singhpk234 @sririshindra for reviewing! |
…12260) * Spark: Structured Streaming read limit support follow-up Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug. * Use enhanced for loop.
…13099) backports apache#12260 to Spark 3.4
This fixes the TODO in #4479.
Use the ReadLimit passed in to
SparkMicroBatchStream::latestOffset(Offset, ReadLimit). In testing this, a bug was found inSparkMicroBatchStream::getDefaultReadLimit()and fixed.