Skip to content

Conversation

@zuston
Copy link
Member

@zuston zuston commented Nov 29, 2022

What changes were proposed in this pull request?

Fix potential memory leak when encountering disk unhealthy

Why are the changes needed?

When encountering disk unhealthy and exceed the timeout of pendingDataShuffleFlushEvent, it will release memory. But in current codebase, it wont release the data reference and cause the memory leak.

Does this PR introduce any user-facing change?

No

How was this patch tested?

@zuston
Copy link
Member Author

zuston commented Nov 29, 2022

@xianjingfeng @jerqi PTAL . Does this is a critical bug? I found the shuffle-server is always oom when encountering disk unhealthy(like reaching the highwatermark).

@roryqi roryqi requested a review from xianjingfeng November 29, 2022 03:34
@roryqi
Copy link
Contributor

roryqi commented Nov 29, 2022

Could you explain why this cause memory leak?

@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2022

Codecov Report

Merging #370 (86f8320) into master (ad51341) will increase coverage by 0.72%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #370      +/-   ##
============================================
+ Coverage     58.01%   58.73%   +0.72%     
- Complexity     1361     1581     +220     
============================================
  Files           171      193      +22     
  Lines          9006    10843    +1837     
  Branches        787      947     +160     
============================================
+ Hits           5225     6369    +1144     
- Misses         3449     4099     +650     
- Partials        332      375      +43     
Impacted Files Coverage Δ
...org/apache/uniffle/server/ShuffleFlushManager.java 79.14% <100.00%> (+0.22%) ⬆️
...org/apache/spark/shuffle/writer/AddBlockEvent.java 0.00% <0.00%> (ø)
.../org/apache/spark/shuffle/writer/WriterBuffer.java 93.18% <0.00%> (ø)
...java/org/apache/hadoop/mapred/SortWriteBuffer.java 90.90% <0.00%> (ø)
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java 90.90% <0.00%> (ø)
...pache/spark/shuffle/writer/WriteBufferManager.java 83.82% <0.00%> (ø)
.../java/org/apache/spark/shuffle/RssSparkConfig.java 93.61% <0.00%> (ø)
...pache/hadoop/mapreduce/task/reduce/RssShuffle.java 0.00% <0.00%> (ø)
...che/spark/shuffle/writer/BufferManagerOptions.java 75.67% <0.00%> (ø)
...org/apache/spark/shuffle/RssSparkShuffleUtils.java 68.42% <0.00%> (ø)
... and 14 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston
Copy link
Member Author

zuston commented Nov 29, 2022

Could you explain why this cause memory leak?

It releases memory but not release the data reference. Right?

@xianjingfeng
Copy link
Member

Could you explain why this cause memory leak?

It releases memory but not release the data reference. Right?

It seems yes. I think it is the reason of #164

Copy link
Member

@xianjingfeng xianjingfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@roryqi roryqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. LGTM, thanks @zuston @xianjingfeng

@roryqi roryqi merged commit 0f0d3b6 into apache:master Nov 29, 2022
@zuston
Copy link
Member Author

zuston commented Nov 29, 2022

Maybe this PR should be cherry-picked to release-0.6.0-RC2.

@roryqi
Copy link
Contributor

roryqi commented Nov 29, 2022

Maybe this PR should be cherry-picked to release-0.6.0-RC2.

You can cherry-pick this pr to branch-0.6. And reply the mail to the release manager in the dev@uniffle.apache.org.

@kaijchen
Copy link
Member

Maybe this PR should be cherry-picked to release-0.6.0-RC2.

Thanks @zuston for the fix. Yes I think this is an important fix, I'll backport this into branch-0.6 and start a v0.6.0-rc3 release.

kaijchen pushed a commit that referenced this pull request Nov 30, 2022
### What changes were proposed in this pull request?

Fix potential memory leak when encountering disk unhealthy

### Why are the changes needed?

When encountering disk unhealthy and exceed the timeout of pendingDataShuffleFlushEvent, it will release memory. But in current codebase, it wont release the data reference and cause the memory leak.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
No need.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants