-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: tidying up gwas catalog ingestion and process configuration #426
Conversation
…_gwas_configuration
…s/gentropy into ds_gwas_configuration
…_gwas_configuration
…s/gentropy into ds_gwas_configuration
for more information, see https://pre-commit.ci
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #426 +/- ##
==========================================
+ Coverage 85.67% 85.95% +0.28%
==========================================
Files 89 96 +7
Lines 2101 2628 +527
==========================================
+ Hits 1800 2259 +459
- Misses 301 369 +68
|
@@ -1,10 +1,33 @@ | |||
# Release specific configuration: | |||
release_version: "24.01" | |||
version: "XX.XX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needed to be kept to make sure other, so far unchanged parameters still works. Need to be refactor later.
release_folder: gs://genetics_etl_python_playground/releases/${datasets.release_version} | ||
|
||
inputs: gs://genetics_etl_python_playground/input | ||
outputs: gs://genetics_etl_python_playground/output/python_etl/parquet/${datasets.version} | ||
|
||
## Datasets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure how detailed these configs should be. I assume all the files the ETL DAG has to have access to should be here. However this makes a bit tricy to make sure the file parameters are consistent across this config and the DAG definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, YAML config and DAG configuration are at the moment two separate entities that don't talk to each other
study_locus_input_path: ??? | ||
ld_index_path: ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path is always the same, added to main config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Looks like a previous error
@@ -14,7 +14,9 @@ | |||
CLUSTER_NAME = "otg-gwascatalog-harmonisation" | |||
AUTOSCALING = "gwascatalog-harmonisation" | |||
|
|||
SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats" | |||
SUMMARY_STATS_BUCKET_NAME = "gwas_catalog_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to try this, not 100% sure if all these changes are correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes make sense. It's just a matter of running it
SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised" | ||
MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/" | ||
# Setting up bucket name and output object names: | ||
GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved all configuration outside the actual step definitions to make sure no parameters/output path is obscure. However it would be nice to abstract these paths further because these outputs needs to be kept consistent with parameters the other DAGs use/produce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all makes sense. Thoughts:
- It's a lot of changes in the airflow layer. We are probably missing something but we will only know by running
- The
gentropy
changes look solid and you picked some bugs. No concerns - The
update_GWAS_Catalog_data.sh
looks very patchy. Unless I'm missing some operation I would probably rely on a GCSToGCSOperator to do the copying within GCP.
study_locus_input_path: ??? | ||
ld_index_path: ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Looks like a previous error
@@ -14,7 +14,9 @@ | |||
CLUSTER_NAME = "otg-gwascatalog-harmonisation" | |||
AUTOSCALING = "gwascatalog-harmonisation" | |||
|
|||
SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats" | |||
SUMMARY_STATS_BUCKET_NAME = "gwas_catalog_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes make sense. It's just a matter of running it
SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised" | ||
MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/" | ||
# Setting up bucket name and output object names: | ||
GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
I totally agree, when this script was written there was no airflow in place. It should be refactored and moved into airflow. |
…_gwas_configuration
Updates:
gs://gwas_catalog_data
harmonised_summary_statistics
, ~5.7TB) and pre-harmonised (raw_summary_statistics
, ~7.1TB) summary statistics are moved here.gs://gwas_catalog_data/curated_inputs/
update_GWAS_Catalog_data.sh
script in the utils folder.manifests/GWAS_Catalog_curated_data_update.log
. This file contains GWAS Catalog release date, version etc.GWAS Catalog bucket structure:
The data in the
study_index
,credible_set_datasets
,manifests
andstudy_locus_datasets
folders are regenerated by the gwas pre-process dag. The content of these folders can be propagated upon running a release.Contenst of the manifests folder
gwas_catalog_data_update.log
- the log file generated upon refreshing curated GWAS Catalog data.gwas_catalog_harmonised_sumstats_list.txt
- list of studies with harmonised summary statistics we have ingested.gwas_catalog_study_curation.tsv
- curation table we generated in-house for studies with summary statisticsgwas_catalog_curated_included_studies
- list of study ids that eligible for ingestion in the curated path.gwas_catalog_curation_excluded_studies
- studies that were excluded from ingestion in the curated path + annotation on why the exclusion happened.gwas_catalog_summary_statistics_excluded_studies
- study ids that were excluded from summary statistics ingestion.gwas_catalog_summary_statistics_included_studies
- study ids that were eligible for summary statistics ingestion.