-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD #22324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@srowen reworked cuz the author is inactive and can you check? (btw, it's ok that the credit of this commit goes to the original author.) |
|
|
||
| class FileSourceSuite extends SharedSQLContext { | ||
|
|
||
| test("SPARK-25237 compute correct input metrics in FileScanRDD") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move this suite into FileBasedDataSourceSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
we can credit to multiple people now though :-) |
|
oh, I see. |
|
Test build #95645 has finished for PR 22324 at commit
|
|
Test build #95650 has finished for PR 22324 at commit
|
|
retest this please |
|
Test build #95655 has finished for PR 22324 at commit
|
|
ping @srowen @HyukjinKwon |
| try { | ||
| spark.read.csv(path).limit(1).collect() | ||
| sparkContext.listenerBus.waitUntilEmpty(1000L) | ||
| assert(bytesReads.sum === 7860) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the sum should be 10*2 + 90*3 + 900*4 = 3890. That's the size of the CSV file that's written too, when I try it locally. When I run this code without the change here, I get 7820+7820 = 15640. So this is better! but I wonder why it ends up thinking it reads about twice the bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, Spark run with local[2] and each scan thread points to the same CSV file. Since each thread gets the file size thru Hadoop APIs, the total byteRead becomes 2 * the file size, IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7860/2=3930, 40 bytes more than expected, but I'm willing to believe there's a good reason for that somewhere in how it gets read. Clearly it's much better than the answer of 15640, so willing to believe this is fixing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, actually the file size is 3890, but the hadoop API (FileSystem.getAllStatistics ) reports that number (3930`). I didn't look into the Hadoop code yet, so I don't get why. I'll dig into it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, Spark run with local[2] and each scan thread points to the same CSV file. Since each thread gets the file size thru Hadoop APIs, the total byteRead becomes 2 * the file size, IIUC.
I am afraid it's not that case, csv will infer schema first, which will try to load the the first row in the path, then the actually read. That's why the input bytes read is doubled. It may be more reasonable to just write and read text file.
As for 3930 = 3890 + 40, the extra 40 bytes is the crc file size. Hadoop uses ChecksumFileSystem internally.
And one more thing: this test case may be inaccurate. If the task completes successfully, all the data is consumed, updateBytesReadWithFileSize is a no-op, and updateBytesRead() in the close function will update the correct size.
FYI @maropu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Can you make a pr to fix that?
## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from #22232. Closes #22232 ## How was this patch tested? Added tests in `FileBasedDataSourceSuite`. Closes #22324 from maropu/pr22232-2. Lead-authored-by: dujunling <dujunling@huawei.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit ed249db) Signed-off-by: Sean Owen <sean.owen@databricks.com>
|
Merged to master/2.4 |
|
If I find the reason why the numbers are different, I'll make pr in a new jira. |
This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from apache#22232. Closes apache#22232 Added tests in `FileBasedDataSourceSuite`. Closes apache#22324 from maropu/pr22232-2. Lead-authored-by: dujunling <dujunling@huawei.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit ed249db) Ref: LIHADOOP-41272 RB=1446834 BUG=LIHADOOP-41272 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=fli
What changes were proposed in this pull request?
This pr removed the method
updateBytesReadWithFileSizeinFileScanRDDbecause it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers.This is rework from #22232.
Closes #22232
How was this patch tested?
Added tests in
FileBasedDataSourceSuite.