[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value #29541

waleedfateem · 2020-08-25T20:27:09Z

The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class.

What changes were proposed in this pull request?

I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment.

Why are the changes needed?

An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-7282

Does this PR introduce any user-facing change?

Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate.

How was this patch tested?

Checked changes locally in browser

docs/configuration.md

HyukjinKwon · 2020-08-26T04:05:25Z

ok to test

SparkQA · 2020-08-26T04:23:56Z

Test build #127911 has finished for PR 29541 at commit ae0f0c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…sion default value Modified configuration docs to clarify that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is actually dependent on the Hadoop version used by the runtime.

SparkQA · 2020-08-26T15:37:37Z

Test build #127928 has finished for PR 29541 at commit 7e8d0f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-08-27T14:06:19Z

Merged to master/3.0

…sion default value The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class. ### What changes were proposed in this pull request? I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment. ### Why are the changes needed? An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA: https://issues.apache.org/jira/browse/MAPREDUCE-7282 ### Does this PR introduce _any_ user-facing change? Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate. ### How was this patch tested? Checked changes locally in browser Closes #29541 from waleedfateem/SPARK-32701. Authored-by: waleedfateem <waleed.fateem@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 8749b2b) Signed-off-by: Sean Owen <srowen@gmail.com>

probot-autolabeler bot added the DOCS label Aug 25, 2020

wangyum reviewed Aug 26, 2020

View reviewed changes

docs/configuration.md Outdated Show resolved Hide resolved

[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.ver…

7e8d0f8

…sion default value Modified configuration docs to clarify that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is actually dependent on the Hadoop version used by the runtime.

waleedfateem force-pushed the SPARK-32701 branch from ae0f0c5 to 7e8d0f8 Compare August 26, 2020 15:25

srowen approved these changes Aug 27, 2020

View reviewed changes

srowen closed this in 8749b2b Aug 27, 2020

dongjoon-hyun mentioned this pull request Sep 28, 2020

[SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default #29895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value #29541

[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value #29541

Uh oh!

waleedfateem commented Aug 25, 2020

Uh oh!

Uh oh!

HyukjinKwon commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

srowen commented Aug 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value #29541

[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value #29541

Uh oh!

Conversation

waleedfateem commented Aug 25, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

HyukjinKwon commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

srowen commented Aug 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants