chore: tidying up gwas catalog ingestion and process configuration #426

DSuveges · 2024-01-17T15:14:08Z

Updates:

New location for all gwas catalog data: gs://gwas_catalog_data
All harmonised (harmonised_summary_statistics, ~5.7TB) and pre-harmonised (raw_summary_statistics, ~7.1TB) summary statistics are moved here.
Curated data is located under gs://gwas_catalog_data/curated_inputs/
The update of these files are done by calling update_GWAS_Catalog_data.sh script in the utils folder.
The fetched files are no longer versioned or time-stamped. They have constant names that the main configuration can refer to (no updates is required).
The version logs generated upon data update is uploaded: manifests/GWAS_Catalog_curated_data_update.log. This file contains GWAS Catalog release date, version etc.
The update script also saves a snapshot from the study curation file from the curation repo into the manifest folder.
All files have underscores now.
The gwas preprocess dag runs.
The gwas harmonisation dag was also updated but not sure if runs as I could not update the raw sumstats folder (no scrum access)
As the business logic of the process did not change, I didn't do any deep QC on the results.

GWAS Catalog bucket structure:

gs://gwas_catalog_data/credible_set_datasets/
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_curated
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_summary_stats
gs://gwas_catalog_data/curated_inputs/
gs://gwas_catalog_data/harmonised_summary_statistics/
gs://gwas_catalog_data/manifests/
gs://gwas_catalog_data/raw_summary_statistics/
gs://gwas_catalog_data/study_index/
gs://gwas_catalog_data/study_locus_datasets/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_window_clumped/

The data in the study_index , credible_set_datasets, manifests and study_locus_datasets folders are regenerated by the gwas pre-process dag. The content of these folders can be propagated upon running a release.

Contenst of the manifests folder

gwas_catalog_data_update.log - the log file generated upon refreshing curated GWAS Catalog data.
gwas_catalog_harmonised_sumstats_list.txt - list of studies with harmonised summary statistics we have ingested.
gwas_catalog_study_curation.tsv - curation table we generated in-house for studies with summary statistics
gwas_catalog_curated_included_studies - list of study ids that eligible for ingestion in the curated path.
gwas_catalog_curation_excluded_studies - studies that were excluded from ingestion in the curated path + annotation on why the exclusion happened.
gwas_catalog_summary_statistics_excluded_studies - study ids that were excluded from summary statistics ingestion.
gwas_catalog_summary_statistics_included_studies - study ids that were eligible for summary statistics ingestion.

…_gwas_configuration

…s/gentropy into ds_gwas_configuration

…_gwas_configuration

…s/gentropy into ds_gwas_configuration

for more information, see https://pre-commit.ci

codecov-commenter · 2024-01-17T22:59:18Z

Codecov Report

Attention: 176 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (07f32e2) 85.95%.
Report is 81 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #426      +/-   ##
==========================================
+ Coverage   85.67%   85.95%   +0.28%     
==========================================
  Files          89       96       +7     
  Lines        2101     2628     +527     
==========================================
+ Hits         1800     2259     +459     
- Misses        301      369      +68

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <100.00%> (ø)`
src/airflow/dags/dag_preprocess.py	`100.00% <ø> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/airflow/dags/gwas_curation_update.py	`100.00% <100.00%> (ø)`
src/gentropy/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/data/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/schemas/__init__.py	`100.00% <ø> (ø)`
src/gentropy/cli.py	`91.66% <100.00%> (ø)`
src/gentropy/common/Liftover.py	`80.64% <ø> (ø)`
... and 82 more

…s/gentropy into ds_gwas_configuration

DSuveges · 2024-01-18T09:26:29Z

config/datasets/ot_gcp.yaml

@@ -1,10 +1,33 @@
 # Release specific configuration:
 release_version: "24.01"
+version: "XX.XX"


This needed to be kept to make sure other, so far unchanged parameters still works. Need to be refactor later.

DSuveges · 2024-01-18T09:28:22Z

config/datasets/ot_gcp.yaml

 release_folder: gs://genetics_etl_python_playground/releases/${datasets.release_version}

 inputs: gs://genetics_etl_python_playground/input
 outputs: gs://genetics_etl_python_playground/output/python_etl/parquet/${datasets.version}

+## Datasets:


I'm not exactly sure how detailed these configs should be. I assume all the files the ETL DAG has to have access to should be here. However this makes a bit tricy to make sure the file parameters are consistent across this config and the DAG definition.

unfortunately, YAML config and DAG configuration are at the moment two separate entities that don't talk to each other

DSuveges · 2024-01-18T09:29:20Z

config/step/ot_ld_based_clumping.yaml

 study_locus_input_path: ???
-ld_index_path: ???


This path is always the same, added to main config.

makes sense. Looks like a previous error

DSuveges · 2024-01-18T09:31:20Z

src/airflow/dags/gwas_catalog_harmonisation.py

@@ -14,7 +14,9 @@
 CLUSTER_NAME = "otg-gwascatalog-harmonisation"
 AUTOSCALING = "gwascatalog-harmonisation"

-SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats"
+SUMMARY_STATS_BUCKET_NAME = "gwas_catalog_data"


We need to try this, not 100% sure if all these changes are correct.

changes make sense. It's just a matter of running it

DSuveges · 2024-01-18T09:33:25Z

src/airflow/dags/gwas_catalog_preprocess.py

-SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised"
-MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/"
+# Setting up bucket name and output object names:
+GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data"


I moved all configuration outside the actual step definitions to make sure no parameters/output path is obscure. However it would be nice to abstract these paths further because these outputs needs to be kept consistent with parameters the other DAGs use/produce.

d0choa

It all makes sense. Thoughts:

It's a lot of changes in the airflow layer. We are probably missing something but we will only know by running
The gentropy changes look solid and you picked some bugs. No concerns
The update_GWAS_Catalog_data.sh looks very patchy. Unless I'm missing some operation I would probably rely on a GCSToGCSOperator to do the copying within GCP.

d0choa · 2024-01-18T10:39:38Z

config/step/ot_ld_based_clumping.yaml

 study_locus_input_path: ???
-ld_index_path: ???


makes sense. Looks like a previous error

d0choa · 2024-01-18T10:43:16Z

src/airflow/dags/gwas_catalog_harmonisation.py

@@ -14,7 +14,9 @@
 CLUSTER_NAME = "otg-gwascatalog-harmonisation"
 AUTOSCALING = "gwascatalog-harmonisation"

-SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats"
+SUMMARY_STATS_BUCKET_NAME = "gwas_catalog_data"


changes make sense. It's just a matter of running it

d0choa · 2024-01-18T10:44:26Z

src/airflow/dags/gwas_catalog_preprocess.py

-SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised"
-MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/"
+# Setting up bucket name and output object names:
+GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data"


DSuveges · 2024-01-18T11:11:01Z

The update_GWAS_Catalog_data.sh looks very patchy. Unless I'm missing some operation I would probably rely on a GCSToGCSOperator to do the copying within GCP.

I totally agree, when this script was written there was no airflow in place. It should be refactored and moved into airflow.

…_gwas_configuration

DSuveges added 8 commits January 17, 2024 12:10

chore: updaing configs to the propsed release folder structure

64fef48

chore: configuration of GWAS Catalog ingestion clean up

0eb9df7

chore: resolve conflict

e21d9b3

Merge branch 'dev' into ds_gwas_configuration

8af469d

chore: updatig more gwas catalog related configs

023cabc

Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…

9cb52ba

…_gwas_configuration

Merge branch 'ds_gwas_configuration' of https://github.com/opentarget…

cdadcaf

…s/gentropy into ds_gwas_configuration

Merge branch 'dev' into ds_gwas_configuration

87176b9

DSuveges changed the title ~~Chore: tidying up gwas catalog ingestion and process configuration~~ chore: tidying up gwas catalog ingestion and process configuration Jan 17, 2024

DSuveges and others added 4 commits January 17, 2024 22:53

chore: finalising dags

5ae2718

Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…

5fd0687

…_gwas_configuration

Merge branch 'ds_gwas_configuration' of https://github.com/opentarget…

ec8bd6f

…s/gentropy into ds_gwas_configuration

[pre-commit.ci] auto fixes from pre-commit.com hooks

21a480d

for more information, see https://pre-commit.ci

DSuveges added 2 commits January 18, 2024 09:16

chore: finalising dag config

c164d92

Merge branch 'ds_gwas_configuration' of https://github.com/opentarget…

0b0b925

…s/gentropy into ds_gwas_configuration

DSuveges commented Jan 18, 2024

View reviewed changes

DSuveges added 4 commits January 18, 2024 09:34

refactor: reverting ad-hoc changes

7312617

chore: truning full DAG on

2ee51ac

fix: docs updated so mkdocs won't fail

faa6d56

Merge branch 'dev' into ds_gwas_configuration

07f32e2

DSuveges marked this pull request as ready for review January 18, 2024 10:03

DSuveges requested a review from d0choa January 18, 2024 10:03

d0choa approved these changes Jan 18, 2024

View reviewed changes

Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…

0984635

…_gwas_configuration

DSuveges merged commit 2d8b08b into dev Jan 18, 2024
3 checks passed

DSuveges mentioned this pull request Jan 18, 2024

Prototyping release data folder structure opentargets/issues#3193

Closed

4 tasks

DSuveges deleted the ds_gwas_configuration branch January 18, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: tidying up gwas catalog ingestion and process configuration #426

chore: tidying up gwas catalog ingestion and process configuration #426

DSuveges commented Jan 17, 2024 •

edited

Loading

codecov-commenter commented Jan 17, 2024 •

edited

Loading

DSuveges Jan 18, 2024

DSuveges Jan 18, 2024

d0choa Jan 18, 2024

DSuveges Jan 18, 2024

d0choa Jan 18, 2024

DSuveges Jan 18, 2024

d0choa Jan 18, 2024

DSuveges Jan 18, 2024

d0choa Jan 18, 2024

d0choa left a comment

d0choa Jan 18, 2024

d0choa Jan 18, 2024

d0choa Jan 18, 2024

DSuveges commented Jan 18, 2024

chore: tidying up gwas catalog ingestion and process configuration #426

chore: tidying up gwas catalog ingestion and process configuration #426

Conversation

DSuveges commented Jan 17, 2024 • edited Loading

codecov-commenter commented Jan 17, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d0choa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DSuveges commented Jan 18, 2024

DSuveges commented Jan 17, 2024 •

edited

Loading

codecov-commenter commented Jan 17, 2024 •

edited

Loading