perf(efm): enhanced flexibility mode in genetics etl cluster #63
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Introduction of the new SuSiE credible sets from
gwas_catalog
(gs://gwas_catalog_sumstats_susie/credible_set_clean) resulted inColocStep
failures. Job performed onotg-etl
cluster took ~4h and did not finish in that time - see job.The most of the error logs trace the fact that he executors got lost during the job execution.
The
otg-etl
cluster uses the following autoscaling policyWhich specifies the ratio of
secondaryWorkers
toprimaryWorkers
to be max100:2
.Caution
In case of the Coloc step, which requires many intensive shuffle operations, if the input dataset increases, the complexity of shuffling will also increase, thus resulting in time increase. This is a potential issue when we store the shuffle partitions on lost exectuors, as described in the EFM (Enhanced Flexibility Mode) description. Losing a worker will make the task restart, resulting in run time elongation.
Further more this can not be determined in advance due to the nature of preemptible (secondary) workers.
To accomodate for the lost shuffle partitions the EFM mode can be utilized.
The EFM mode will make the cluster to save the shuffle partitions only on primary workers. This will mean that we have to accomodate the disk size of the workers and effectively change the autoscaling policy, as EFM does not support the graceful decomissioning of workers.
Changes
The above tweaks were added to the existing dataproc cluster setup to accomodate the shuffling operations in Coloc step:
dataproc:efm.spark.shuffle=primary-workers
propertyAll of above comes from reading the documentation on EFM
create_cluster
function to accomodate all parameters fromClusterGenerator
Additionally
prerequisites
in the node configuration, if the tasks were commented out.Note
The Coloc step succeded in 1h 20 min with the EFM mode enabled - see job - we still experienced executor loses