Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Sep 4, 2018

What changes were proposed in this pull request?

This pr removed the method updateBytesReadWithFileSize in FileScanRDD because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers.

This is rework from #22232.

Closes #22232

How was this patch tested?

Added tests in FileBasedDataSourceSuite.

dujunling and others added 4 commits August 25, 2018 14:28
@maropu
Copy link
Member Author

maropu commented Sep 4, 2018

@srowen reworked cuz the author is inactive and can you check? (btw, it's ok that the credit of this commit goes to the original author.)


class FileSourceSuite extends SharedSQLContext {

test("SPARK-25237 compute correct input metrics in FileScanRDD") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this suite into FileBasedDataSourceSuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@HyukjinKwon
Copy link
Member

we can credit to multiple people now though :-)

@maropu
Copy link
Member Author

maropu commented Sep 4, 2018

oh, I see.

@SparkQA
Copy link

SparkQA commented Sep 4, 2018

Test build #95645 has finished for PR 22324 at commit 510d729.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FileSourceSuite extends SharedSQLContext

@SparkQA
Copy link

SparkQA commented Sep 4, 2018

Test build #95650 has finished for PR 22324 at commit bc05a35.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Sep 4, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Sep 4, 2018

Test build #95655 has finished for PR 22324 at commit bc05a35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Sep 5, 2018

ping @srowen @HyukjinKwon

try {
spark.read.csv(path).limit(1).collect()
sparkContext.listenerBus.waitUntilEmpty(1000L)
assert(bytesReads.sum === 7860)
Copy link
Member

@srowen srowen Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the sum should be 10*2 + 90*3 + 900*4 = 3890. That's the size of the CSV file that's written too, when I try it locally. When I run this code without the change here, I get 7820+7820 = 15640. So this is better! but I wonder why it ends up thinking it reads about twice the bytes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, Spark run with local[2] and each scan thread points to the same CSV file. Since each thread gets the file size thru Hadoop APIs, the total byteRead becomes 2 * the file size, IIUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7860/2=3930, 40 bytes more than expected, but I'm willing to believe there's a good reason for that somewhere in how it gets read. Clearly it's much better than the answer of 15640, so willing to believe this is fixing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, actually the file size is 3890, but the hadoop API (FileSystem.getAllStatistics ) reports that number (3930`). I didn't look into the Hadoop code yet, so I don't get why. I'll dig into it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, Spark run with local[2] and each scan thread points to the same CSV file. Since each thread gets the file size thru Hadoop APIs, the total byteRead becomes 2 * the file size, IIUC.

I am afraid it's not that case, csv will infer schema first, which will try to load the the first row in the path, then the actually read. That's why the input bytes read is doubled. It may be more reasonable to just write and read text file.

As for 3930 = 3890 + 40, the extra 40 bytes is the crc file size. Hadoop uses ChecksumFileSystem internally.

And one more thing: this test case may be inaccurate. If the task completes successfully, all the data is consumed, updateBytesReadWithFileSize is a no-op, and updateBytesRead() in the close function will update the correct size.
FYI @maropu

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Can you make a pr to fix that?

asfgit pushed a commit that referenced this pull request Sep 7, 2018
## What changes were proposed in this pull request?
This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers.

This is rework from #22232.

Closes #22232

## How was this patch tested?
Added tests in `FileBasedDataSourceSuite`.

Closes #22324 from maropu/pr22232-2.

Lead-authored-by: dujunling <dujunling@huawei.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(cherry picked from commit ed249db)
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@srowen
Copy link
Member

srowen commented Sep 7, 2018

Merged to master/2.4

@asfgit asfgit closed this in ed249db Sep 7, 2018
@maropu
Copy link
Member Author

maropu commented Sep 7, 2018

If I find the reason why the numbers are different, I'll make pr in a new jira.
#22324 (comment)

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers.

This is rework from apache#22232.

Closes apache#22232

Added tests in `FileBasedDataSourceSuite`.

Closes apache#22324 from maropu/pr22232-2.

Lead-authored-by: dujunling <dujunling@huawei.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(cherry picked from commit ed249db)

Ref: LIHADOOP-41272

RB=1446834
BUG=LIHADOOP-41272
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants