Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPREDUCE-7474. Improve Manifest committer resilience (#6716) #6824

Conversation

steveloughran
Copy link
Contributor

backport of #6716

Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience

Task manifest saving is re-attempted on failure; the number of attempts made is configurable with the option:

mapreduce.manifest.committer.manifest.save.attempts

  • The default is 5.
  • The minimum is 1; asking for less is ignored.
  • A retry policy adds 500ms of sleep per attempt.
  • Move from classic rename() to commitFile() to rename the file, after calling getFileStatus() to get its length and possibly etag. This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach the ResilientCommitByRename callbacks in abfs, which report on the outcome to the caller...which is then logged at WARN.
  • New statistic task_stage_save_summary_file to distinguish from other saving operations (job success/report file). This is only saved to the manifest on task commit retries, and provides statistics on all previous unsuccessful attempts to save the manifests
  • test changes to match the codepath changes, including improvements in fault injection.

Directory size for deletion

New option

mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to reduce IO load. Consult the documentation for more details.

Success file printing

The command to print a JSON _SUCCESS file from this committer and any S3A committer is now something which can be invoked from the mapred command:

mapred successfile

Contributed by Steve Loughran

How was this patch tested?

yetus's work, if happy will validate on abfs.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 6m 58s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 shelldocs 0m 1s Shelldocs was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 11 new or modified test files.
_ branch-3.4 Compile Tests _
+0 🆗 mvndep 4m 3s Maven dependency ordering for branch
+1 💚 mvninstall 30m 45s branch-3.4 passed
+1 💚 compile 9m 33s branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 8m 57s branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 2m 16s branch-3.4 passed
+1 💚 mvnsite 2m 19s branch-3.4 passed
+1 💚 javadoc 2m 3s branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 54s branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 4m 0s branch-3.4 passed
+1 💚 shadedclient 21m 56s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 22s Maven dependency ordering for patch
+1 💚 mvninstall 1m 15s the patch passed
+1 💚 compile 8m 57s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 8m 57s the patch passed
+1 💚 compile 8m 32s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 8m 32s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 2m 8s /results-checkstyle-root.txt root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 22)
+1 💚 mvnsite 2m 6s the patch passed
+1 💚 shellcheck 0m 10s No new issues.
+1 💚 javadoc 2m 1s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 57s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 4m 28s the patch passed
+1 💚 shadedclient 21m 57s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 6m 33s hadoop-mapreduce-client-core in the patch passed.
-1 ❌ unit 119m 38s /patch-unit-hadoop-mapreduce-project.txt hadoop-mapreduce-project in the patch failed.
-1 ❌ unit 0m 34s /patch-unit-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
+0 🆗 asflicense 0m 35s ASF License check generated no output?
281m 57s
Reason Tests
Failed junit tests hadoop.mapreduce.v2.TestMRJobs
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/artifact/out/Dockerfile
GITHUB PR #6824
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname Linux 09ebb8b2ae97 5.15.0-106-generic #116-Ubuntu SMP Wed Apr 17 09:17:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-3.4 / a9b079f
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/testReport/
Max. process+thread count 1603 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:935: public final boolean recovered;:26: Variable 'recovered' must be private and have accessor methods. [VisibilityModifier]

I'm gong to have to do a quick followup, aren't I? doc should be tuned too

@steveloughran
Copy link
Contributor Author

update: ignoring yetus. the field is final, this is essentially a case class.

@steveloughran steveloughran merged commit affb725 into apache:branch-3.4 May 15, 2024
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants