refactor: genetic dags #26

project-defiant · 2024-09-23T11:14:24Z

Context

This PR closes #27

The aim of this PR is to unify existing dags (except the genetics_etl) to reuse existing approach for generate_dag logic implemented for the genetics_etl that creates the topology of the dag based on configuration file.

This process streamlines the dependency management and allows for better understanding of the dependencies between the DAG steps.

Previous implementation had configuration distributed accross multiple files in the config directory. This way the configuration was not isolated for each DAG, resulting in heavy lookup into the nested structures of the configs and dags code to understand the overall processes.

By merging configuration of multiple gentropy steps and extracting this config as a single entity called dag config that is stored under the src/ot_orchestration/dags/config/*.yaml should increase the readability and verbosity of each process.
Enabiling the nodes and prerequisites in most cases allows to skip on reading the logic of the DAG itself and focus on the process definition maintained in the dag config.

Things implemented:

Refactoring of ukb_ppp_eur_harmonisation DAG
Refactoring of gwas_curation_update DAG
Refactoring of gwas_catalog_preprocess DAG
Refactoring of gnomad_ingestion DAG
Deprecation of gwas_catalog_harmonisation DAG -> the content is under development of gwas_catalog_pipeline DAG
Refactoring of finngen_ukb_meta_harmonisation DAG
Refactoring of finngen_ingestion DAG + addition of extra parameter sample_size
Refactoring of eqtl_ingestion DAG.
New bunch of tests for utils
Refactoring of dataproc releated functions.
Fixes to development process

Allow for shell to be inferred from env variable, so after running make dev the bashrc file is not populated with junk lines,
Remove sourcing of poetry shell as default from setup script,
Fix duplicate pre-commit call that causes the pre-commit to run twice

project-defiant · 2024-09-23T12:50:58Z

@javfg ready to review!

javfg

It all looks pretty good, except for that little comment.

I'm starting to worry a bit about the organization of the code for this part, though.

I'll explain:

It seems to me we're extracting all the structure of the DAGs into external functions and utilities, and I wonder how good that is. I know it saves repetition, but I am not against repeating certain things if that means the DAG structure is more explicit. I feel like repetition could even be a good thing.

I'm still not knowledgeable enough about the genetics DAGs so I assume this is the way to go. But to me it looks a bit strange that although we've extracted most of the structure from the python files, we still need one for each and we don't get the benefit of making it flexible enough that we can have a single DAG that just runs config files. Not sure we would even want that (I think I don't). :)

In any case, this was just food for thought. Good job! 🎉

src/ot_orchestration/dags/eqtl_ingestion.py

project-defiant requested a review from javfg September 23, 2024 11:15

Szymon Szyszkowski added 16 commits September 23, 2024 14:07

refactor: unified project name variables

d41da98

fix: drop old exported names that no longer exist

b1d3924

fix: added missing hash to dev command

527c351

chore: do not activate poetry shell on setup

8b880c9

chore: enable zsh shell in dev env

61d6b78

refactor: updated type definitions

66f65d3

feat: functions to handle structured node configs

9cf2d8d

chore: tests for new functions

fbd5319

feat: added path property for gcs path obj

37c15b4

refactor: changes to dataproc

c535dcb

refactor: gwas_catalog old dags

74de473

refactor: ukbppp harmonisation dag

b11b3f0

refactor: eqtl_catalogue dag

43e3d32

refactor: finngen ingestion dag

ec52f87

refactor: unify dag structures

0541742

fix: import in tests

0a7f7a7

project-defiant force-pushed the szsz-sync-latest-gentropy-dags branch from 2d2a60a to 0a7f7a7 Compare September 23, 2024 12:07

fix: removed import

6b99a1e

project-defiant changed the title ~~Szsz sync latest gentropy dags~~ refactor: genetic dags Sep 23, 2024

Szymon Szyszkowski added 4 commits September 23, 2024 13:30

fix: refactor tests

fd4e9e6

chore: drop import

112c95f

chore: drop typing_extensions

cea50f8

chore: yaml format

ebdf0bf

javfg approved these changes Sep 23, 2024

View reviewed changes

src/ot_orchestration/dags/eqtl_ingestion.py Outdated Show resolved Hide resolved

refactor: use gcspath instead of string methods

3a356f8

javfg approved these changes Sep 24, 2024

View reviewed changes

project-defiant merged commit 7888537 into dev Sep 24, 2024
2 checks passed

project-defiant deleted the szsz-sync-latest-gentropy-dags branch September 24, 2024 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: genetic dags #26

refactor: genetic dags #26

project-defiant commented Sep 23, 2024 •

edited

Loading

project-defiant commented Sep 23, 2024

javfg left a comment

refactor: genetic dags #26

refactor: genetic dags #26

Conversation

project-defiant commented Sep 23, 2024 • edited Loading

Context

Things implemented:

project-defiant commented Sep 23, 2024

javfg left a comment

Choose a reason for hiding this comment

project-defiant commented Sep 23, 2024 •

edited

Loading