-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAPREDUCE-7474. Improve Manifest committer resilience (#6716) #6824
MAPREDUCE-7474. Improve Manifest committer resilience (#6716) #6824
Conversation
Improve task commit resilience everywhere and add an option to reduce delete IO requests on job cleanup (relevant for ABFS and HDFS). Task Commit Resilience ---------------------- Task manifest saving is re-attempted on failure; the number of attempts made is configurable with the option: mapreduce.manifest.committer.manifest.save.attempts * The default is 5. * The minimum is 1; asking for less is ignored. * A retry policy adds 500ms of sleep per attempt. * Move from classic rename() to commitFile() to rename the file, after calling getFileStatus() to get its length and possibly etag. This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach the ResilientCommitByRename callbacks in abfs, which report on the outcome to the caller...which is then logged at WARN. * New statistic task_stage_save_summary_file to distinguish from other saving operations (job success/report file). This is only saved to the manifest on task commit retries, and provides statistics on all previous unsuccessful attempts to save the manifests + test changes to match the codepath changes, including improvements in fault injection. Directory size for deletion --------------------------- New option mapreduce.manifest.committer.cleanup.parallel.delete.base.first This attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout. This option is disabled by default; Consider enabling it for abfs to reduce IO load. Consult the documentation for more details. Success file printing --------------------- The command to print a JSON _SUCCESS file from this committer and any S3A committer is now something which can be invoked from the mapred command: mapred successfile <path to file> Contributed by Steve Loughran
💔 -1 overall
This message was automatically generated. |
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:935: public final boolean recovered;:26: Variable 'recovered' must be private and have accessor methods. [VisibilityModifier] I'm gong to have to do a quick followup, aren't I? doc should be tuned too |
update: ignoring yetus. the field is final, this is essentially a case class. |
backport of #6716
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).
Task Commit Resilience
Task manifest saving is re-attempted on failure; the number of attempts made is configurable with the option:
mapreduce.manifest.committer.manifest.save.attempts
Directory size for deletion
New option
mapreduce.manifest.committer.cleanup.parallel.delete.base.first
This attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout.
This option is disabled by default; Consider enabling it for abfs to reduce IO load. Consult the documentation for more details.
Success file printing
The command to print a JSON _SUCCESS file from this committer and any S3A committer is now something which can be invoked from the mapred command:
mapred successfile
Contributed by Steve Loughran
How was this patch tested?
yetus's work, if happy will validate on abfs.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?