Skip to content

Conversation

@frankliee
Copy link
Contributor

@frankliee frankliee commented Jul 14, 2022

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS,
It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance.
This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes.

Property Name Default Description
mapreduce.rss.reduce.remote.spill.enable false Whether to use remote spill
mapreduce.rss.reduce.remote.spill.attempt.inc 1 Increase reduce attempts as hdfs is easier to crash than disk
mapreduce.rss.reduce.remote.spill.replication 1 The replication number to spill data to hdfs
mapreduce.rss.reduce.remote.spill.retries 5 The retry number to spill data to hdfs

How was this patch tested?

New UT and IT with remote spill.

Co-authored-by: roryqi roryqi@tencent.com

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2022

Codecov Report

Merging #55 (08077bb) into master (aa02ee6) will increase coverage by 0.67%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master      #55      +/-   ##
============================================
+ Coverage     54.89%   55.56%   +0.67%     
+ Complexity     1092      991     -101     
============================================
  Files           146      135      -11     
  Lines          7775     6736    -1039     
  Branches        749      647     -102     
============================================
- Hits           4268     3743     -525     
+ Misses         3270     2782     -488     
+ Partials        237      211      -26     
Impacted Files Coverage Δ
...storage/handler/impl/DataSkippableReadHandler.java 81.25% <0.00%> (-3.13%) ⬇️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java
...n/java/org/apache/hadoop/mapreduce/RssMRUtils.java
...pache/hadoop/mapreduce/task/reduce/RssShuffle.java
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java
...rg/apache/hadoop/mapred/RssMapOutputCollector.java
.../hadoop/mapreduce/task/reduce/RssEventFetcher.java
.../hadoop/mapreduce/task/reduce/RssBypassWriter.java
...g/apache/hadoop/mapred/SortWriteBufferManager.java
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa02ee6...08077bb. Read the comment docs.

@jerqi
Copy link
Contributor

jerqi commented Jul 14, 2022

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS, It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance. This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes. rss.reduce.remote.spill.enable (default false)

How was this patch tested?

New UT and IT with remote spill.

Co-authored-by: roryqi roryqi@tencent.com

Because this pr will introduce user-facing change. We should update doc.
And we should supply the performance test results.

@frankliee frankliee force-pushed the mr-remote-spill-apache branch from 08077bb to 4537e5a Compare July 15, 2022 07:16
frankliee and others added 2 commits July 15, 2022 16:17
Add RssInMemoryMerger

We need write memory data to Hdfs

Yes

UT

Co-authored-by: roryqi <roryqi@tencent.com>
@jerqi
Copy link
Contributor

jerqi commented Jul 15, 2022

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS, It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance. This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes. rss.reduce.remote.spill.enable (default false)

How was this patch tested?

New UT and IT with remote spill.

Co-authored-by: roryqi roryqi@tencent.com

update your description and document. This pr introduce another configuration option.

@frankliee frankliee force-pushed the mr-remote-spill-apache branch from 4537e5a to f4540ad Compare July 15, 2022 09:08
@jerqi
Copy link
Contributor

jerqi commented Jul 15, 2022

LGTM except for pr's description and document.

@frankliee
Copy link
Contributor Author

What changes were proposed in this pull request?

Rewrite Mapreduce's MergerManager to spill sorted segments to HDFS, It returns a merge-sorted iterator to read these HDFS segments.

Why are the changes needed?

In cloud, machines may have very limited disk space and performance. This PR allows to spill data to remote storage (e.g., hdfs)

Does this PR introduce any user-facing change?

Yes. rss.reduce.remote.spill.enable (default false)

How was this patch tested?

New UT and IT with remote spill.
Co-authored-by: roryqi roryqi@tencent.com

update your description and document. This pr introduce another configuration option.

Doc is updated

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@frankliee frankliee changed the title [Feature][MR] Support remote spill [Experimental Feature] MR Supports Remote Spill Jul 15, 2022
@frankliee frankliee merged commit f4ce2ed into apache:master Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants