[#808] improvement(spark): Verify the number of written records to enhance data correctness #1558

rickyma · 2024-03-06T05:19:32Z

What changes were proposed in this pull request?

Verify the number of written records to enhance data accuracy.
Make sure all data records are sent by clients.
Make sure bugs like #714 will never be introduced into the code.

Why are the changes needed?

A follow-up PR for #848.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

…en records to enhance data accuracy

codecov-commenter · 2024-03-06T05:27:13Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 54.57%. Comparing base (d8aedf3) to head (1c4f86b).

Files	Patch %	Lines
...pache/spark/shuffle/writer/WriteBufferManager.java	75.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1558      +/-   ##
============================================
+ Coverage     53.65%   54.57%   +0.91%     
- Complexity     2818     2819       +1     
============================================
  Files           436      416      -20     
  Lines         24549    22195    -2354     
  Branches       2080     2080              
============================================
- Hits          13172    12113    -1059     
+ Misses        10554     9330    -1224     
+ Partials        823      752      -71

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-03-06T05:48:51Z

Test Results

2 312 files ±0 2 312 suites ±0 4h 36m 26s ⏱️ +50s
823 tests ±0 822 ✅ ±0 1 💤 ±0 0 ❌ ±0
9 697 runs ±0 9 683 ✅ ±0 14 💤 ±0 0 ❌ ±0

Results for commit 1c4f86b. ± Comparison against base commit d8aedf3.

♻️ This comment has been updated with latest results.

rickyma · 2024-03-06T06:06:54Z

PTAL @jerqi @zuston

zuston

LGTM overall. Left some minor comments.

And I hope this could be covered by test cases. Like throwing exception when encountering records is not same.

client-spark/spark2/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java

client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java

rickyma · 2024-03-06T07:05:59Z

And I hope this could be covered by test cases. Like throwing exception when encountering records is not same.

I think the existing UTs have already covered this. Because this is a base method which will be used in many places.
If anything bad is introduced into the code in the future, I think the UTs won't pass.

jerqi · 2024-03-06T07:16:07Z

Maybe record count isn't enough. We should check block count, too.

rickyma · 2024-03-06T07:21:35Z

Maybe record count isn't enough. We should check block count, too.

BlockIds will be checked in the method checkBlockSendResult. If any block is failed or lost, we will know.

zuston · 2024-03-06T08:06:10Z

And I hope this could be covered by test cases. Like throwing exception when encountering records is not same.

I think the existing UTs have already covered this. Because this is a base method which will be used in many places. If anything bad is introduced into the code in the future, I think the UTs won't pass.

No I hope you could mock throwing exception when record is not same.

rickyma · 2024-03-06T14:15:01Z

And I hope this could be covered by test cases. Like throwing exception when encountering records is not same.

I think the existing UTs have already covered this. Because this is a base method which will be used in many places. If anything bad is introduced into the code in the future, I think the UTs won't pass.

No I hope you could mock throwing exception when record is not same.

I think it's not easy to mock this and might not be very meaningful. The issue in the past was due to a critical bug in the code itself, caused by concurrency problems, which led to data loss even after calling the method WriteBuffer.addRecord. If we revert the code to the previous problematic version and use the content of my current PR, the unit test would definitely fail.

The fact that the unit test with the current check passes actually indicates that the current code is working correctly.

zuston · 2024-03-07T07:13:29Z

Merged. Thanks @rickyma

…1756) ### What changes were proposed in this pull request? 1. When the spill ratio is `1.0` , the process of calculating target spill size will be ignored to avoid potential race condition that the `usedBytes` and `inSendBytes` are not thread safe. This could guarantee that the all data is flushed to the shuffle server at the end of task. 2. Adding the `bufferManager's` buffer remaining check ### Why are the changes needed? Due to the #1670 , the partial data held by the bufferManager will not be flushed to shuffle servers in some corner cases, this will make task fail fast rather than silently data loss that should thanks the #1558 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests.

…umber (apache#1756) ### What changes were proposed in this pull request? 1. When the spill ratio is `1.0` , the process of calculating target spill size will be ignored to avoid potential race condition that the `usedBytes` and `inSendBytes` are not thread safe. This could guarantee that the all data is flushed to the shuffle server at the end of task. 2. Adding the `bufferManager's` buffer remaining check ### Why are the changes needed? Due to the apache#1670 , the partial data held by the bufferManager will not be flushed to shuffle servers in some corner cases, this will make task fail fast rather than silently data loss that should thanks the apache#1558 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests.

[apache#808][FOLLOWUP] improvement(spark): Verify the number of writt…

6de2bc9

…en records to enhance data accuracy

zuston reviewed Mar 6, 2024

View reviewed changes

Fix review comments

1c4f86b

rickyma force-pushed the pr-848-followup branch from 900d206 to 1c4f86b Compare March 6, 2024 07:01

rickyma requested a review from zuston March 6, 2024 07:02

zuston approved these changes Mar 7, 2024

View reviewed changes

zuston changed the title ~~[#808][FOLLOWUP] improvement(spark): Verify the number of written records to enhance data accuracy~~ [#808] improvement(spark): Verify the number of written records to enhance data correctness Mar 7, 2024

zuston merged commit ec4251d into apache:master Mar 7, 2024

rickyma deleted the pr-848-followup branch May 5, 2024 08:33

zuston mentioned this pull request May 30, 2024

[#1755] fix(spark): Avoid task failure of inconsistent record number #1756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#808] improvement(spark): Verify the number of written records to enhance data correctness #1558

[#808] improvement(spark): Verify the number of written records to enhance data correctness #1558

Uh oh!

rickyma commented Mar 6, 2024 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 6, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Mar 6, 2024 •

edited

Loading

Uh oh!

rickyma commented Mar 6, 2024

Uh oh!

zuston left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rickyma commented Mar 6, 2024 •

edited

Loading

Uh oh!

jerqi commented Mar 6, 2024

Uh oh!

rickyma commented Mar 6, 2024

Uh oh!

zuston commented Mar 6, 2024

Uh oh!

rickyma commented Mar 6, 2024

Uh oh!

zuston commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[#808] improvement(spark): Verify the number of written records to enhance data correctness #1558

[#808] improvement(spark): Verify the number of written records to enhance data correctness #1558

Uh oh!

Conversation

rickyma commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov-commenter commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

rickyma commented Mar 6, 2024

Uh oh!

zuston left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rickyma commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerqi commented Mar 6, 2024

Uh oh!

rickyma commented Mar 6, 2024

Uh oh!

zuston commented Mar 6, 2024

Uh oh!

rickyma commented Mar 6, 2024

Uh oh!

zuston commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rickyma commented Mar 6, 2024 •

edited

Loading

codecov-commenter commented Mar 6, 2024 •

edited

Loading

github-actions bot commented Mar 6, 2024 •

edited

Loading

rickyma commented Mar 6, 2024 •

edited

Loading