feat(genetics_etl): data freeze 10 #82
Closed
+66
−61
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Succesful run of genetics etl at 2024.11.22 including full run in gs://ot_orchestration/releases/24.11_freeze10
minor fix to the
create_cluster
function, as default values (num_workers == 1) fails to. build cluster. Must be more then one worker.Patched
eqtl_catalogue
datasetL2G Model training skipped (use of existing model in gcs)
Tweaks to the colocalisation infrastructure that allowed to run eCaviar step succestully
eCaviar step succesful run - https://console.cloud.google.com/dataproc/jobs/bcfb9168-d753-466c-bd59-7ade6ce3ad60/monitoring?region=europe-west1&project=open-targets-genetics-dev
Coloc step run - https://console.cloud.google.com/dataproc/jobs/4ffde9db-b624-4c89-8f47-22ab52f60177/monitoring?region=europe-west1&project=open-targets-genetics-dev
I have experimented on both steps configuration:
Note
By summing up, I think that EFM, although benefitial for shuffling for long running tasks, is causing disk space issues as the partittions are stored just in the primary workers, the no EFM approach will use more disk space for partition storage. Although not tested in this PR context, the assumption is that coloc should also run without the EFM, but increased executor memory, just like eCaviar)
Note
Due to the fact that freeze9 coloc and ecaviar worked without any workarounds and the variantIndex between the freeze9 and freeze10 has not changed a lot (freeze10 contains patched eQTL credible sets by @addramir) most likely solution is that the cause for overlap skewness is the eQTL Catalogue, rather then finngen.
Warning
The colocalisation is still not perfect, the succesfull runs were very fragile and more then half of the tasks at the last overlap stages were failing