-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified Pipeline #84
Merged
Merged
Unified Pipeline #84
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 tasks
javfg
force-pushed
the
unified-orchestrator
branch
from
November 27, 2024 09:55
096115e
to
31a4013
Compare
I'd say we're ready to merge this, before it grows even bigger. We can keep adding on top later as we do fixes/improvements. |
Gentropy step `variant_annotation` is taking about 40 minutes and does not need the cluster, so by the time things are ready for the next step, the cluster is dead. Parametrizing the ttl and passig 1 hour, we ensure things stay alive.
Data is being partitioned too much, we are getting about 7.2k 600kB files for colocalisation steps. This slows down the elasticsearch ingestion. This adds a coalescing step after spark is done to reduce those numbers.
javfg
force-pushed
the
unified-orchestrator
branch
from
November 28, 2024 14:28
93043db
to
d1dfe69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Massive work @javfg, all makes sense. I have some minor comments, please take a look. Thank you for this!
project-defiant
approved these changes
Nov 29, 2024
javfg
added a commit
that referenced
this pull request
Nov 29, 2024
* feat: clarify unified pipeline configuration * fix: ensure containers without envs work * feat: improved labels * feat: cluster and vm name generators * fix: unified orchestrator becomes unified pipeline * feat: extract pis env vars * feat: add ontoform * feat: add genetics steps to dependency graph * feat: complete dependency generation * feat: gentropy tasks and dependencies skeleton * feat: templating yaml load * feat: templating hocon load * feat: project id argument for cluster * fix: make yamlformat behave * fix: multiple config changes * feat: unified orchestrator * chore: change uo to up * chore: format yaml files * fix: gentropy fixes * fix: gentropy topology and settings * fix: parametrize cluster ttl Gentropy step `variant_annotation` is taking about 40 minutes and does not need the cluster, so by the time things are ready for the next step, the cluster is dead. Parametrizing the ttl and passig 1 hour, we ensure things stay alive. * fix: config changes from freeze10 * feat: configurable gentropy version * feat: crude labels for gentropy resources * fix: vep job name bug * feat: parametrize vep version * fix: coalesce data partitions Data is being partitioned too much, we are getting about 7.2k 600kB files for colocalisation steps. This slows down the elasticsearch ingestion. This adds a coalescing step after spark is done to reduce those numbers. * fix: some config changes * chore: update images * fix: typing * fix: correct executor topology
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the first (almost) working version of the unified pipeline DAG.
There is a lot of stuff still pending, but we are able to run the whole PIS->Ontoform->ETL->Gentropy->ETL process, except for a few steps that require manual intervention:
etl_disease
— pending implementation of hpo and hpo_phenotypes in Ontoformgentropy_variant_annotation
— service account problems in google cloudgentropy_coloc
steps — pending fixes in GentropyI'll make a short summary of the changes in the PR to make it easier to review:
src/ot_orchestration/dags/config/etl.conf
— ETL config file (with some path changes), no new to review.src/ot_orchestration/dags/config/gentropy.yaml
— Gentropy config file. This is the part of the genetics_etl.yaml that contains the configuration for Gentropy itself (more on this below).src/ot_orchestration/dags/config/pis.yaml
— Pis config file, no need to review.src/ot_orchestration/dags/config/unified_pipeline.py
— This is the Config class for the Unified Pipeline DAG. It builds a running configuration based on all the config files for each part/src/ot_orchestration/dags/config/unified_pipeline.yaml
— This is the configuration file for the Unified Pipeline. The aim is that we only need to edit this file create a new release.src/ot_orchestration/dags/unified_pipeline.py
— This is the main file for the pipeline DAG, the most important part to review.src/ot_orchestration/utils/dataproc.py
— Added aproject_id
parameter. This set of methods must be converted into Operators.src/ot_orchestration/utils/utils.py
— Added some naming functions and a HOCON file parser for the ETL config.