[Improvement]record split across writer buffer to save memory #157

summaryzb · 2022-08-12T08:33:22Z

What changes were proposed in this pull request?

Record added to WriterBuffer will split across byte array buffer, all the byte array will be fully used

Why are the changes needed?

Previously for example, the length of every record is 2k, every time we add a record, we create a buffer with 3k length by default and wrap the previous buffer as WrappedBuffer which resulting in wasting 1k memory in every WrappedBuffer.
Apply this PR will save memory

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass unit test

summaryzb · 2022-08-12T09:28:35Z

@colinmjj @jerqi PTAL

codecov-commenter · 2022-08-12T09:47:53Z

Codecov Report

Merging #157 (e65178f) into master (f49b566) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #157      +/-   ##
============================================
+ Coverage     57.38%   57.43%   +0.04%     
+ Complexity     1208     1205       -3     
============================================
  Files           150      150              
  Lines          8209     8196      -13     
  Branches        775      771       -4     
============================================
- Hits           4711     4707       -4     
+ Misses         3255     3249       -6     
+ Partials        243      240       -3

Impacted Files	Coverage Δ
...pache/spark/shuffle/writer/WriteBufferManager.java	`82.67% <100.00%> (-0.79%)`	⬇️
.../org/apache/spark/shuffle/writer/WriterBuffer.java	`100.00% <100.00%> (+6.52%)`	⬆️
...apache/uniffle/coordinator/ApplicationManager.java	`83.70% <0.00%> (+2.88%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

colinmjj · 2022-08-15T02:41:36Z

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/WriterBuffer.java

-      }
+    int require = calculateMemoryCost(length);
+    int hasCopied = 0;
+    if (require > 0 && buffer != null && buffer.length - nextOffset > 0) {


How about:

// comments if (require > 0) { // comments if (buffer != null) { int hasCopied = xxx; // commnets if (hasCopied > 0) { } } }

Follow this suggestion

colinmjj · 2022-08-15T02:48:50Z

@summaryzb For this optimization, I think there has once more System.arraycopy to add record, and it will impact performance a lot with previous test.
For this PR, I think it is target to make size of block larger than before. To reduce memory pressure in client side, how about change the parameter for flush strategy?

summaryzb · 2022-08-15T06:34:04Z

@summaryzb For this optimization, I think there has once more System.arraycopy to add record, and it will impact performance a lot with previous test. For this PR, I think it is target to make size of block larger than before. To reduce memory pressure in client side, how about change the parameter for flush strategy?

Yeah,it indeed has once more System.arraycopy when record splits across writer buffer, however it result in reducing the num of System.arraycopy when transfer all the buffer into one byte array. Additionally, we should consider the total System.arraycopy for a certain total number of bytes to be written, rather than for a certain buffer.

colinmjj · 2022-08-15T07:43:10Z

@summaryzb For this optimization, I think there has once more System.arraycopy to add record, and it will impact performance a lot with previous test. For this PR, I think it is target to make size of block larger than before. To reduce memory pressure in client side, how about change the parameter for flush strategy?

Yeah,it indeed has once more System.arraycopy when record splits across writer buffer, however it result in reducing the num of System.arraycopy when transfer all the buffer into one byte array. Additionally, we should consider the total System.arraycopy for a certain total number of bytes to be written, rather than for a certain buffer.

For the situation with buffer = 3k, length of record = 2.8k, after insert 1000 records:
with current implementation, calls of System.arraycopy is about 2000(include insert & merge)
with PR, calls of System.arraycopy is about 2800
I agree that memory waste happen in your example, but I prefer to avoid possible performance regression with more resources.

summaryzb · 2022-08-15T11:42:53Z

but I prefer to avoid possible performance regression with more resources.
Well, how about adding a config to make this pr as an optional strategy, while current implementation is a default choice

colinmjj · 2022-08-16T02:33:21Z

In spark client, all memory are requested from executor, so there shouldn't have critical problem, eg, memory leak, oom, etc.
Can you show the case how this PR improve the application, for example, X% performance improvement with this PR.

summaryzb · 2022-08-16T04:06:47Z

no obvious benefit can be gather from this pr, close it

summaryzb mentioned this pull request Aug 12, 2022

[Bugfix]Fix memory leak & Improve buffer management #67

Closed

summaryzb force-pushed the fix-memory-leak branch from 9df952a to 37112a0 Compare August 12, 2022 09:31

[Improvement]record split across writer buffer to save memory

20ff34d

summaryzb force-pushed the fix-memory-leak branch from 37112a0 to 20ff34d Compare August 12, 2022 10:12

jerqi requested a review from colinmjj August 15, 2022 02:17

colinmjj reviewed Aug 15, 2022

View reviewed changes

optimize addRecord

e65178f

summaryzb closed this Aug 16, 2022

jerqi mentioned this pull request Aug 22, 2022

[Improvement] Disallow sendShuffleData if requireBufferId expired #76

Closed

summaryzb deleted the fix-memory-leak branch April 5, 2023 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement]record split across writer buffer to save memory #157

[Improvement]record split across writer buffer to save memory #157

Uh oh!

summaryzb commented Aug 12, 2022

Uh oh!

summaryzb commented Aug 12, 2022

Uh oh!

codecov-commenter commented Aug 12, 2022 •

edited

Loading

Uh oh!

colinmjj Aug 15, 2022

Uh oh!

summaryzb Aug 15, 2022

Uh oh!

colinmjj commented Aug 15, 2022

Uh oh!

summaryzb commented Aug 15, 2022

Uh oh!

colinmjj commented Aug 15, 2022 •

edited

Loading

Uh oh!

summaryzb commented Aug 15, 2022

Uh oh!

colinmjj commented Aug 16, 2022

Uh oh!

summaryzb commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Improvement]record split across writer buffer to save memory #157

[Improvement]record split across writer buffer to save memory #157

Uh oh!

Conversation

summaryzb commented Aug 12, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

summaryzb commented Aug 12, 2022

Uh oh!

codecov-commenter commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

colinmjj Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

summaryzb Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

colinmjj commented Aug 15, 2022

Uh oh!

summaryzb commented Aug 15, 2022

Uh oh!

colinmjj commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

summaryzb commented Aug 15, 2022

Uh oh!

colinmjj commented Aug 16, 2022

Uh oh!

summaryzb commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Aug 12, 2022 •

edited

Loading

colinmjj commented Aug 15, 2022 •

edited

Loading