[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD #22324

maropu · 2018-09-04T02:05:20Z

What changes were proposed in this pull request?

This pr removed the method updateBytesReadWithFileSize in FileScanRDD because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers.

This is rework from #22232.

Closes #22232

How was this patch tested?

Added tests in FileBasedDataSourceSuite.

…tatistics to update the inputMetrics

maropu · 2018-09-04T02:07:24Z

@srowen reworked cuz the author is inactive and can you check? (btw, it's ok that the credit of this commit goes to the original author.)

HyukjinKwon · 2018-09-04T03:01:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala

+
+class FileSourceSuite extends SharedSQLContext {
+
+  test("SPARK-25237 compute correct input metrics in FileScanRDD") {


Shall we move this suite into FileBasedDataSourceSuite?

HyukjinKwon · 2018-09-04T03:02:17Z

we can credit to multiple people now though :-)

maropu · 2018-09-04T04:16:24Z

oh, I see.

SparkQA · 2018-09-04T05:55:50Z

Test build #95645 has finished for PR 22324 at commit 510d729.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileSourceSuite extends SharedSQLContext

SparkQA · 2018-09-04T07:05:02Z

Test build #95650 has finished for PR 22324 at commit bc05a35.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-04T07:06:58Z

retest this please

SparkQA · 2018-09-04T10:59:23Z

Test build #95655 has finished for PR 22324 at commit bc05a35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-05T01:02:26Z

ping @srowen @HyukjinKwon

srowen · 2018-09-05T01:13:31Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+      try {
+        spark.read.csv(path).limit(1).collect()
+        sparkContext.listenerBus.waitUntilEmpty(1000L)
+        assert(bytesReads.sum === 7860)


So the sum should be 10*2 + 90*3 + 900*4 = 3890. That's the size of the CSV file that's written too, when I try it locally. When I run this code without the change here, I get 7820+7820 = 15640. So this is better! but I wonder why it ends up thinking it reads about twice the bytes?

In this test, Spark run with local[2] and each scan thread points to the same CSV file. Since each thread gets the file size thru Hadoop APIs, the total byteRead becomes 2 * the file size, IIUC.

7860/2=3930, 40 bytes more than expected, but I'm willing to believe there's a good reason for that somewhere in how it gets read. Clearly it's much better than the answer of 15640, so willing to believe this is fixing something.

yea, actually the file size is 3890, but the hadoop API (FileSystem.getAllStatistics ) reports that number (3930`). I didn't look into the Hadoop code yet, so I don't get why. I'll dig into it later.

In this test, Spark run with local[2] and each scan thread points to the same CSV file. Since each thread gets the file size thru Hadoop APIs, the total byteRead becomes 2 * the file size, IIUC.

I am afraid it's not that case, csv will infer schema first, which will try to load the the first row in the path, then the actually read. That's why the input bytes read is doubled. It may be more reasonable to just write and read text file.

As for 3930 = 3890 + 40, the extra 40 bytes is the crc file size. Hadoop uses ChecksumFileSystem internally.

And one more thing: this test case may be inaccurate. If the task completes successfully, all the data is consumed, updateBytesReadWithFileSize is a no-op, and updateBytesRead() in the close function will update the correct size.
FYI @maropu

Ah, I see. Can you make a pr to fix that?

## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from #22232. Closes #22232 ## How was this patch tested? Added tests in `FileBasedDataSourceSuite`. Closes #22324 from maropu/pr22232-2. Lead-authored-by: dujunling <dujunling@huawei.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit ed249db) Signed-off-by: Sean Owen <sean.owen@databricks.com>

srowen · 2018-09-07T04:46:08Z

Merged to master/2.4

maropu · 2018-09-07T05:53:40Z

If I find the reason why the numbers are different, I'll make pr in a new jira.
#22324 (comment)

This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from apache#22232. Closes apache#22232 Added tests in `FileBasedDataSourceSuite`. Closes apache#22324 from maropu/pr22232-2. Lead-authored-by: dujunling <dujunling@huawei.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit ed249db) Ref: LIHADOOP-41272 RB=1446834 BUG=LIHADOOP-41272 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=fli

dujunling and others added 4 commits August 25, 2018 14:28

remove updateBytesReadWithFileSize because we use Hadoop FileSystem s…

0f75257

…tatistics to update the inputMetrics

add ut

53dd42c

ut

1c32646

fix

510d729

HyukjinKwon reviewed Sep 4, 2018

View reviewed changes

Fix

bc05a35

srowen approved these changes Sep 4, 2018

View reviewed changes

srowen reviewed Sep 5, 2018

View reviewed changes

maropu mentioned this pull request Sep 6, 2018

[SPARK-14922][SPARK-17732][SPARK-23866][SQL] Support partition filters in ALTER TABLE DROP PARTITION #20999

Closed

asfgit closed this in ed249db Sep 7, 2018

yucai mentioned this pull request Sep 8, 2018

[SPARK-24925][SQL] input bytesRead metrics fluctuate from time to time #21791

Closed


		class FileSourceSuite extends SharedSQLContext {

		test("SPARK-25237 compute correct input metrics in FileScanRDD") {

[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD #22324

[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD #22324

Uh oh!

Conversation

maropu commented Sep 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Sep 4, 2018

Uh oh!

HyukjinKwon Sep 4, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Sep 4, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 4, 2018

Uh oh!

maropu commented Sep 4, 2018

Uh oh!

SparkQA commented Sep 4, 2018

Uh oh!

SparkQA commented Sep 4, 2018

Uh oh!

maropu commented Sep 4, 2018

Uh oh!

SparkQA commented Sep 4, 2018

Uh oh!

maropu commented Sep 5, 2018

Uh oh!

srowen Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Sep 5, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Sep 5, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

advancedxy Dec 26, 2019

Choose a reason for hiding this comment

Uh oh!

maropu Dec 26, 2019

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 7, 2018

Uh oh!

maropu commented Sep 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maropu commented Sep 4, 2018 •

edited

Loading

srowen Sep 5, 2018 •

edited

Loading