Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: tidying up gwas catalog ingestion and process configuration #426

Merged
merged 19 commits into from
Jan 18, 2024

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Jan 17, 2024

Updates:

  • New location for all gwas catalog data: gs://gwas_catalog_data
  • All harmonised (harmonised_summary_statistics, ~5.7TB) and pre-harmonised (raw_summary_statistics, ~7.1TB) summary statistics are moved here.
  • Curated data is located under gs://gwas_catalog_data/curated_inputs/
  • The update of these files are done by calling update_GWAS_Catalog_data.sh script in the utils folder.
  • The fetched files are no longer versioned or time-stamped. They have constant names that the main configuration can refer to (no updates is required).
  • The version logs generated upon data update is uploaded: manifests/GWAS_Catalog_curated_data_update.log. This file contains GWAS Catalog release date, version etc.
  • The update script also saves a snapshot from the study curation file from the curation repo into the manifest folder.
  • All files have underscores now.
  • The gwas preprocess dag runs.
  • The gwas harmonisation dag was also updated but not sure if runs as I could not update the raw sumstats folder (no scrum access)
  • As the business logic of the process did not change, I didn't do any deep QC on the results.

GWAS Catalog bucket structure:

gs://gwas_catalog_data/credible_set_datasets/
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_curated
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_summary_stats
gs://gwas_catalog_data/curated_inputs/
gs://gwas_catalog_data/harmonised_summary_statistics/
gs://gwas_catalog_data/manifests/
gs://gwas_catalog_data/raw_summary_statistics/
gs://gwas_catalog_data/study_index/
gs://gwas_catalog_data/study_locus_datasets/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_window_clumped/

The data in the study_index , credible_set_datasets, manifests and study_locus_datasets folders are regenerated by the gwas pre-process dag. The content of these folders can be propagated upon running a release.

Contenst of the manifests folder

  • gwas_catalog_data_update.log - the log file generated upon refreshing curated GWAS Catalog data.
  • gwas_catalog_harmonised_sumstats_list.txt - list of studies with harmonised summary statistics we have ingested.
  • gwas_catalog_study_curation.tsv - curation table we generated in-house for studies with summary statistics
  • gwas_catalog_curated_included_studies - list of study ids that eligible for ingestion in the curated path.
  • gwas_catalog_curation_excluded_studies - studies that were excluded from ingestion in the curated path + annotation on why the exclusion happened.
  • gwas_catalog_summary_statistics_excluded_studies - study ids that were excluded from summary statistics ingestion.
  • gwas_catalog_summary_statistics_included_studies - study ids that were eligible for summary statistics ingestion.

@DSuveges DSuveges changed the title Chore: tidying up gwas catalog ingestion and process configuration chore: tidying up gwas catalog ingestion and process configuration Jan 17, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 17, 2024

Codecov Report

Attention: 176 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (07f32e2) 85.95%.
Report is 81 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #426      +/-   ##
==========================================
+ Coverage   85.67%   85.95%   +0.28%     
==========================================
  Files          89       96       +7     
  Lines        2101     2628     +527     
==========================================
+ Hits         1800     2259     +459     
- Misses        301      369      +68     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <100.00%> (ø)
src/airflow/dags/dag_preprocess.py 100.00% <ø> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/airflow/dags/gwas_curation_update.py 100.00% <100.00%> (ø)
src/gentropy/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/data/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/schemas/__init__.py 100.00% <ø> (ø)
src/gentropy/cli.py 91.66% <100.00%> (ø)
src/gentropy/common/Liftover.py 80.64% <ø> (ø)
... and 82 more

@@ -1,10 +1,33 @@
# Release specific configuration:
release_version: "24.01"
version: "XX.XX"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed to be kept to make sure other, so far unchanged parameters still works. Need to be refactor later.

release_folder: gs://genetics_etl_python_playground/releases/${datasets.release_version}

inputs: gs://genetics_etl_python_playground/input
outputs: gs://genetics_etl_python_playground/output/python_etl/parquet/${datasets.version}

## Datasets:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure how detailed these configs should be. I assume all the files the ETL DAG has to have access to should be here. However this makes a bit tricy to make sure the file parameters are consistent across this config and the DAG definition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, YAML config and DAG configuration are at the moment two separate entities that don't talk to each other

study_locus_input_path: ???
ld_index_path: ???
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is always the same, added to main config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Looks like a previous error

@@ -14,7 +14,9 @@
CLUSTER_NAME = "otg-gwascatalog-harmonisation"
AUTOSCALING = "gwascatalog-harmonisation"

SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats"
SUMMARY_STATS_BUCKET_NAME = "gwas_catalog_data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to try this, not 100% sure if all these changes are correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes make sense. It's just a matter of running it

SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised"
MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/"
# Setting up bucket name and output object names:
GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved all configuration outside the actual step definitions to make sure no parameters/output path is obscure. However it would be nice to abstract these paths further because these outputs needs to be kept consistent with parameters the other DAGs use/produce.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@DSuveges DSuveges marked this pull request as ready for review January 18, 2024 10:03
@DSuveges DSuveges requested a review from d0choa January 18, 2024 10:03
Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all makes sense. Thoughts:

  • It's a lot of changes in the airflow layer. We are probably missing something but we will only know by running
  • The gentropy changes look solid and you picked some bugs. No concerns
  • The update_GWAS_Catalog_data.sh looks very patchy. Unless I'm missing some operation I would probably rely on a GCSToGCSOperator to do the copying within GCP.

study_locus_input_path: ???
ld_index_path: ???
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Looks like a previous error

@@ -14,7 +14,9 @@
CLUSTER_NAME = "otg-gwascatalog-harmonisation"
AUTOSCALING = "gwascatalog-harmonisation"

SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats"
SUMMARY_STATS_BUCKET_NAME = "gwas_catalog_data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes make sense. It's just a matter of running it

SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised"
MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/"
# Setting up bucket name and output object names:
GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@DSuveges
Copy link
Contributor Author

  • The update_GWAS_Catalog_data.sh looks very patchy. Unless I'm missing some operation I would probably rely on a GCSToGCSOperator to do the copying within GCP.

I totally agree, when this script was written there was no airflow in place. It should be refactored and moved into airflow.

@DSuveges DSuveges merged commit 2d8b08b into dev Jan 18, 2024
3 checks passed
@DSuveges DSuveges deleted the ds_gwas_configuration branch January 18, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants