-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: genetic dags #26
Conversation
2d2a60a
to
0a7f7a7
Compare
@javfg ready to review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all looks pretty good, except for that little comment.
I'm starting to worry a bit about the organization of the code for this part, though.
I'll explain:
It seems to me we're extracting all the structure of the DAGs into external functions and utilities, and I wonder how good that is. I know it saves repetition, but I am not against repeating certain things if that means the DAG structure is more explicit. I feel like repetition could even be a good thing.
I'm still not knowledgeable enough about the genetics DAGs so I assume this is the way to go. But to me it looks a bit strange that although we've extracted most of the structure from the python files, we still need one for each and we don't get the benefit of making it flexible enough that we can have a single DAG that just runs config files. Not sure we would even want that (I think I don't). :)
In any case, this was just food for thought. Good job! 🎉
Context
This PR closes #27
The aim of this PR is to unify existing dags (except the genetics_etl) to reuse existing approach for
generate_dag
logic implemented for thegenetics_etl
that creates the topology of the dag based on configuration file.This process streamlines the dependency management and allows for better understanding of the dependencies between the DAG steps.
Previous implementation had configuration distributed accross multiple files in the
config
directory. This way the configuration was not isolated for each DAG, resulting in heavy lookup into the nested structures of the configs and dags code to understand the overall processes.By merging configuration of multiple gentropy steps and extracting this config as a single entity called
dag config
that is stored under thesrc/ot_orchestration/dags/config/*.yaml
should increase the readability and verbosity of each process.Enabiling the nodes and prerequisites in most cases allows to skip on reading the logic of the DAG itself and focus on the process definition maintained in the
dag config
.Things implemented:
ukb_ppp_eur_harmonisation
DAGgwas_curation_update
DAGgwas_catalog_preprocess
DAGgnomad_ingestion
DAGgwas_catalog_harmonisation
DAG -> the content is under development ofgwas_catalog_pipeline
DAGfinngen_ukb_meta_harmonisation
DAGfinngen_ingestion
DAG + addition of extra parametersample_size
eqtl_ingestion
DAG.dataproc
releated functions.make dev
the bashrc file is not populated with junk lines,