Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: genetic dags #26

Merged
merged 22 commits into from
Sep 24, 2024
Merged

refactor: genetic dags #26

merged 22 commits into from
Sep 24, 2024

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Sep 23, 2024

Context

This PR closes #27

The aim of this PR is to unify existing dags (except the genetics_etl) to reuse existing approach for generate_dag logic implemented for the genetics_etl that creates the topology of the dag based on configuration file.

This process streamlines the dependency management and allows for better understanding of the dependencies between the DAG steps.

Previous implementation had configuration distributed accross multiple files in the config directory. This way the configuration was not isolated for each DAG, resulting in heavy lookup into the nested structures of the configs and dags code to understand the overall processes.

By merging configuration of multiple gentropy steps and extracting this config as a single entity called dag config that is stored under the src/ot_orchestration/dags/config/*.yaml should increase the readability and verbosity of each process.
Enabiling the nodes and prerequisites in most cases allows to skip on reading the logic of the DAG itself and focus on the process definition maintained in the dag config.

Things implemented:

  1. Refactoring of ukb_ppp_eur_harmonisation DAG
  2. Refactoring of gwas_curation_update DAG
  3. Refactoring of gwas_catalog_preprocess DAG
  4. Refactoring of gnomad_ingestion DAG
  5. Deprecation of gwas_catalog_harmonisation DAG -> the content is under development of gwas_catalog_pipeline DAG
  6. Refactoring of finngen_ukb_meta_harmonisation DAG
  7. Refactoring of finngen_ingestion DAG + addition of extra parameter sample_size
  8. Refactoring of eqtl_ingestion DAG.
  9. New bunch of tests for utils
  10. Refactoring of dataproc releated functions.
  11. Fixes to development process
  • Allow for shell to be inferred from env variable, so after running make dev the bashrc file is not populated with junk lines,
  • Remove sourcing of poetry shell as default from setup script,
  • Fix duplicate pre-commit call that causes the pre-commit to run twice

@project-defiant project-defiant force-pushed the szsz-sync-latest-gentropy-dags branch from 2d2a60a to 0a7f7a7 Compare September 23, 2024 12:07
@project-defiant project-defiant changed the title Szsz sync latest gentropy dags refactor: genetic dags Sep 23, 2024
@project-defiant
Copy link
Collaborator Author

@javfg ready to review!

Copy link
Member

@javfg javfg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks pretty good, except for that little comment.

I'm starting to worry a bit about the organization of the code for this part, though.

I'll explain:

It seems to me we're extracting all the structure of the DAGs into external functions and utilities, and I wonder how good that is. I know it saves repetition, but I am not against repeating certain things if that means the DAG structure is more explicit. I feel like repetition could even be a good thing.

I'm still not knowledgeable enough about the genetics DAGs so I assume this is the way to go. But to me it looks a bit strange that although we've extracted most of the structure from the python files, we still need one for each and we don't get the benefit of making it flexible enough that we can have a single DAG that just runs config files. Not sure we would even want that (I think I don't). :)

In any case, this was just food for thought. Good job! 🎉

src/ot_orchestration/dags/eqtl_ingestion.py Outdated Show resolved Hide resolved
@project-defiant project-defiant merged commit 7888537 into dev Sep 24, 2024
2 checks passed
@project-defiant project-defiant deleted the szsz-sync-latest-gentropy-dags branch September 24, 2024 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify existing gentropy DAGs
2 participants