Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified Pipeline #84

Merged
merged 31 commits into from
Nov 29, 2024
Merged

Unified Pipeline #84

merged 31 commits into from
Nov 29, 2024

Conversation

javfg
Copy link
Member

@javfg javfg commented Nov 25, 2024

This PR adds the first (almost) working version of the unified pipeline DAG.

There is a lot of stuff still pending, but we are able to run the whole PIS->Ontoform->ETL->Gentropy->ETL process, except for a few steps that require manual intervention:

  • etl_disease — pending implementation of hpo and hpo_phenotypes in Ontoform
  • gentropy_variant_annotation — service account problems in google cloud
  • both gentropy_coloc steps — pending fixes in Gentropy

I'll make a short summary of the changes in the PR to make it easier to review:

src/ot_orchestration/dags/config/etl.conf — ETL config file (with some path changes), no new to review.
src/ot_orchestration/dags/config/gentropy.yaml — Gentropy config file. This is the part of the genetics_etl.yaml that contains the configuration for Gentropy itself (more on this below).
src/ot_orchestration/dags/config/pis.yaml — Pis config file, no need to review.
src/ot_orchestration/dags/config/unified_pipeline.py — This is the Config class for the Unified Pipeline DAG. It builds a running configuration based on all the config files for each part/
src/ot_orchestration/dags/config/unified_pipeline.yaml — This is the configuration file for the Unified Pipeline. The aim is that we only need to edit this file create a new release.
src/ot_orchestration/dags/unified_pipeline.py — This is the main file for the pipeline DAG, the most important part to review.
src/ot_orchestration/utils/dataproc.py — Added a project_id parameter. This set of methods must be converted into Operators.
src/ot_orchestration/utils/utils.py — Added some naming functions and a HOCON file parser for the ETL config.

@javfg javfg mentioned this pull request Nov 25, 2024
28 tasks
@javfg javfg marked this pull request as draft November 25, 2024 12:26
@javfg javfg force-pushed the unified-orchestrator branch from 096115e to 31a4013 Compare November 27, 2024 09:55
@javfg
Copy link
Member Author

javfg commented Nov 27, 2024

I'd say we're ready to merge this, before it grows even bigger.

We can keep adding on top later as we do fixes/improvements.

@javfg javfg marked this pull request as ready for review November 27, 2024 13:39
Data is being partitioned too much, we are getting about 7.2k 600kB
files for colocalisation steps. This slows down the elasticsearch
ingestion.

This adds a coalescing step after spark is done to reduce those
numbers.
@javfg javfg force-pushed the unified-orchestrator branch from 93043db to d1dfe69 Compare November 28, 2024 14:28
Copy link
Collaborator

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive work @javfg, all makes sense. I have some minor comments, please take a look. Thank you for this!

src/ot_orchestration/dags/unified_pipeline.py Show resolved Hide resolved
src/ot_orchestration/utils/utils.py Show resolved Hide resolved
src/ot_orchestration/utils/utils.py Show resolved Hide resolved
src/ot_orchestration/dags/config/etl.conf Show resolved Hide resolved
@javfg javfg merged commit 0b8ac8d into dev Nov 29, 2024
2 checks passed
@javfg javfg deleted the unified-orchestrator branch November 29, 2024 13:55
javfg added a commit that referenced this pull request Nov 29, 2024
* feat: clarify unified pipeline configuration

* fix: ensure containers without envs work

* feat: improved labels

* feat: cluster and vm name generators

* fix: unified orchestrator becomes unified pipeline

* feat: extract pis env vars

* feat: add ontoform

* feat: add genetics steps to dependency graph

* feat: complete dependency generation

* feat: gentropy tasks and dependencies skeleton

* feat: templating yaml load

* feat: templating hocon load

* feat: project id argument for cluster

* fix: make yamlformat behave

* fix: multiple config changes

* feat: unified orchestrator

* chore: change uo to up

* chore: format yaml files

* fix: gentropy fixes

* fix: gentropy topology and settings

* fix: parametrize cluster ttl

Gentropy step `variant_annotation` is taking about 40 minutes and
does not need the cluster, so by the time things are ready for the
next step, the cluster is dead.

Parametrizing the ttl and passig 1 hour, we ensure things stay alive.

* fix: config changes from freeze10

* feat: configurable gentropy version

* feat: crude labels for gentropy resources

* fix: vep job name bug

* feat: parametrize vep version

* fix: coalesce data partitions

Data is being partitioned too much, we are getting about 7.2k 600kB
files for colocalisation steps. This slows down the elasticsearch
ingestion.

This adds a coalescing step after spark is done to reduce those
numbers.

* fix: some config changes

* chore: update images

* fix: typing

* fix: correct executor topology
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants