Unified Pipeline #84

javfg · 2024-11-25T10:20:34Z

This PR adds the first (almost) working version of the unified pipeline DAG.

There is a lot of stuff still pending, but we are able to run the whole PIS->Ontoform->ETL->Gentropy->ETL process, except for a few steps that require manual intervention:

etl_disease — pending implementation of hpo and hpo_phenotypes in Ontoform
gentropy_variant_annotation — service account problems in google cloud
both gentropy_coloc steps — pending fixes in Gentropy

I'll make a short summary of the changes in the PR to make it easier to review:

src/ot_orchestration/dags/config/etl.conf — ETL config file (with some path changes), no new to review.
src/ot_orchestration/dags/config/gentropy.yaml — Gentropy config file. This is the part of the genetics_etl.yaml that contains the configuration for Gentropy itself (more on this below).
src/ot_orchestration/dags/config/pis.yaml — Pis config file, no need to review.
src/ot_orchestration/dags/config/unified_pipeline.py — This is the Config class for the Unified Pipeline DAG. It builds a running configuration based on all the config files for each part/
src/ot_orchestration/dags/config/unified_pipeline.yaml — This is the configuration file for the Unified Pipeline. The aim is that we only need to edit this file create a new release.
src/ot_orchestration/dags/unified_pipeline.py — This is the main file for the pipeline DAG, the most important part to review.
src/ot_orchestration/utils/dataproc.py — Added a project_id parameter. This set of methods must be converted into Operators.
src/ot_orchestration/utils/utils.py — Added some naming functions and a HOCON file parser for the ETL config.

javfg · 2024-11-27T13:39:11Z

I'd say we're ready to merge this, before it grows even bigger.

We can keep adding on top later as we do fixes/improvements.

Gentropy step `variant_annotation` is taking about 40 minutes and does not need the cluster, so by the time things are ready for the next step, the cluster is dead. Parametrizing the ttl and passig 1 hour, we ensure things stay alive.

Data is being partitioned too much, we are getting about 7.2k 600kB files for colocalisation steps. This slows down the elasticsearch ingestion. This adds a coalescing step after spark is done to reduce those numbers.

project-defiant

Massive work @javfg, all makes sense. I have some minor comments, please take a look. Thank you for this!

src/ot_orchestration/dags/config/genetics_etl.yaml

src/ot_orchestration/dags/config/gentropy.yaml

src/ot_orchestration/dags/config/unified_pipeline.yaml

src/ot_orchestration/dags/unified_pipeline.py

src/ot_orchestration/utils/utils.py

src/ot_orchestration/dags/config/etl.conf

* feat: clarify unified pipeline configuration * fix: ensure containers without envs work * feat: improved labels * feat: cluster and vm name generators * fix: unified orchestrator becomes unified pipeline * feat: extract pis env vars * feat: add ontoform * feat: add genetics steps to dependency graph * feat: complete dependency generation * feat: gentropy tasks and dependencies skeleton * feat: templating yaml load * feat: templating hocon load * feat: project id argument for cluster * fix: make yamlformat behave * fix: multiple config changes * feat: unified orchestrator * chore: change uo to up * chore: format yaml files * fix: gentropy fixes * fix: gentropy topology and settings * fix: parametrize cluster ttl Gentropy step `variant_annotation` is taking about 40 minutes and does not need the cluster, so by the time things are ready for the next step, the cluster is dead. Parametrizing the ttl and passig 1 hour, we ensure things stay alive. * fix: config changes from freeze10 * feat: configurable gentropy version * feat: crude labels for gentropy resources * fix: vep job name bug * feat: parametrize vep version * fix: coalesce data partitions Data is being partitioned too much, we are getting about 7.2k 600kB files for colocalisation steps. This slows down the elasticsearch ingestion. This adds a coalescing step after spark is done to reduce those numbers. * fix: some config changes * chore: update images * fix: typing * fix: correct executor topology

javfg mentioned this pull request Nov 25, 2024

Pipeline unification opentargets/issues#3394

Closed

28 tasks

javfg marked this pull request as draft November 25, 2024 12:26

project-defiant mentioned this pull request Nov 26, 2024

feat(genetics_etl): data freeze 10 #82

Closed

javfg force-pushed the unified-orchestrator branch from 096115e to 31a4013 Compare November 27, 2024 09:55

javfg marked this pull request as ready for review November 27, 2024 13:39

javfg added 24 commits November 28, 2024 14:27

feat: clarify unified pipeline configuration

3f42676

fix: ensure containers without envs work

f06857e

feat: improved labels

8320f31

feat: cluster and vm name generators

1184528

fix: unified orchestrator becomes unified pipeline

adba1c4

feat: extract pis env vars

7ee0862

feat: add ontoform

5e9b2c2

feat: add genetics steps to dependency graph

9496295

feat: complete dependency generation

d6192f3

feat: gentropy tasks and dependencies skeleton

27b6d00

feat: templating yaml load

3ef80ec

feat: templating hocon load

5b943bf

feat: project id argument for cluster

faf537c

fix: make yamlformat behave

d97d1b4

fix: multiple config changes

fddf34d

feat: unified orchestrator

c4d04d0

chore: change uo to up

a2f7c27

chore: format yaml files

7bf7115

fix: gentropy fixes

fc5fd45

fix: gentropy topology and settings

b0012a4

fix: parametrize cluster ttl

9fca954

Gentropy step `variant_annotation` is taking about 40 minutes and does not need the cluster, so by the time things are ready for the next step, the cluster is dead. Parametrizing the ttl and passig 1 hour, we ensure things stay alive.

fix: config changes from freeze10

24260a4

feat: configurable gentropy version

2005c1a

feat: crude labels for gentropy resources

e81c8ea

javfg added 7 commits November 28, 2024 14:27

fix: vep job name bug

93ee9b6

feat: parametrize vep version

870f65e

fix: coalesce data partitions

1b86b3d

Data is being partitioned too much, we are getting about 7.2k 600kB files for colocalisation steps. This slows down the elasticsearch ingestion. This adds a coalescing step after spark is done to reduce those numbers.

fix: some config changes

a763128

chore: update images

5e2d433

fix: typing

d9e3517

fix: correct executor topology

d1dfe69

javfg force-pushed the unified-orchestrator branch from 93043db to d1dfe69 Compare November 28, 2024 14:28

project-defiant reviewed Nov 28, 2024

View reviewed changes

project-defiant approved these changes Nov 29, 2024

View reviewed changes

javfg merged commit 0b8ac8d into dev Nov 29, 2024
2 checks passed

javfg deleted the unified-orchestrator branch November 29, 2024 13:55

javfg mentioned this pull request Nov 29, 2024

fix: small fixes for the unified pipeline #86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Pipeline #84

Unified Pipeline #84

javfg commented Nov 25, 2024

javfg commented Nov 27, 2024

project-defiant left a comment •

edited

Loading

Unified Pipeline #84

Unified Pipeline #84

Conversation

javfg commented Nov 25, 2024

javfg commented Nov 27, 2024

project-defiant left a comment • edited Loading

Choose a reason for hiding this comment

project-defiant left a comment •

edited

Loading