Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(credible set qc dag): added dag and docs #59

Merged
merged 1 commit into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/credible_set_qc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Credible set qc dag

Credible set qc is a set of operations performed on the `StudyLocus` datasets originally finemapped by OpenTargets to:

- Ensure pValue of each locus does meet the pre-defined threshold
- Perform repartitioning of the credible sets, as the output from the batch job contains files per loci, resulting in slow queries.
- Ensure no duplicated loci exist in the clean credible sets.

![credible_set_qc](credible_set_qc.svg)

The dag contains following steps:

- qc of credible sets coming from `gwas_catalog_sumstats_susie` bucket
- qc of credible sets coming from `ukb_ppp_eur_data` bucket

> [!NOTE]
> The outputs of the steps are contained in the target bucket with prefix _credible_set_clean_.
62 changes: 62 additions & 0 deletions docs/credible_set_qc/credible_set_qc.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions docs/datasources/gwas_catalog_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,7 @@ Bucket `gs://gwas_catalog_sumstats_susie` contains:

```
gs://gwas_catalog_sumstats_susie/credible_set_datasets/
gs://gwas_catalog_sumstats_susie/credible_sets_clean/
gs://gwas_catalog_sumstats_susie/finemapping_logs/
gs://gwas_catalog_sumstats_susie/finemapping_manifests/
gs://gwas_catalog_sumstats_susie/study_index/
Expand Down Expand Up @@ -324,6 +325,11 @@ The output of finemapping can be found under the:
- `gs://gwas_catalog_sumstats_susie/finemapping_manifests/` - manifests used during the fine mapping job
- `gs://gwas_catalog_sumstats_susie/finemapping_logs/` - logs from the individual finemapping tasks

### Credible set qc

After the finemapping is performed, the qc dag is run. For more detail see [credible set qc dag](../../credible_set_qc/README.md)
The final credible sets are collected in the `gs://gwas_catalog_sumstats_susie/credible_set_clean/`

#### Parametrization of google batch finemapping job

The configuration of the google batch infrastructure and individual step parameters can be found in `gwas_catalog_sumstats_susie_finemapping.yaml` file.
Expand Down
6 changes: 6 additions & 0 deletions docs/datasources/ukb_ppp_eur_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Data stored under `gs://ukb_ppp_eur_data` bucket comes with following structure

```
gs://ukb_ppp_eur_data/credible_set_datasets/susie
gs://ukb_ppp_eur_data/credible_set_clean/
gs://ukb_ppp_eur_data/docs/
gs://ukb_ppp_eur_data/finemapping_logs/
gs://ukb_ppp_eur_data/finemapping_manifests/
Expand Down Expand Up @@ -109,6 +110,11 @@ The output of finemapping can be found under the:
- `gs://ukb_ppp_eur_data/finemapping_manifests/` - manifests used during the fine mapping job
- `gs://ukb_ppp_eur_data/finemapping_logs/` - logs from the individual finemapping tasks

### Credible set qc

After the finemapping is performed, the qc dag is run. For more detail see [credible set qc dag](../../credible_set_qc/README.md).
The final credible sets are collected in the `gs://ukb_ppp_eur_data/credible_set_clean/`.

#### Parametrization of google batch finemapping job

The configuration of the google batch infrastructure and individual step parameters can be found in `ukb_ppp_eur_finemapping.yaml` file.
Expand Down
34 changes: 34 additions & 0 deletions src/ot_orchestration/dags/config/credible_set_qc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
dataproc:
python_main_module: gs://genetics_etl_python_playground/initialisation/gentropy/dev/cli.py
cluster_metadata:
PACKAGE: gs://genetics_etl_python_playground/initialisation/gentropy/dev/gentropy-0.0.0-py3-none-any.whl
cluster_init_script: gs://genetics_etl_python_playground/initialisation/gentropy/dev/install_dependencies_on_cluster.sh
cluster_name: otg-credible-set-qc
autoscaling_policy: otg-etl

nodes:
- id: gwas_catalog_sumstats_susie_credible_set_qc
kind: Task
prerequisites: []
params:
step: credible_set_qc
step.credible_sets_path: gs://gwas_catalog_sumstats_susie/credible_set_datasets
step.output_path: gs://gwas_catalog_sumstats_susie/credible_set_clean
step.p_value_threshold: 1.0e-5
step.purity_min_r2: 0.01
step.n_partitions: 200
step.session.write_mode: overwrite
step.session.start_hail: true

- id: ukb_ppp_eur_data_credible_set_qc
kind: Task
prerequisites: []
params:
step: credible_set_qc
step.credible_sets_path: gs://ukb_ppp_eur_data/credible_set_datasets/susie
step.output_path: gs://ukb_ppp_eur_data/credible_set_clean
step.p_value_threshold: 1.0e-5
step.purity_min_r2: 0.01
step.n_partitions: 50
step.session.write_mode: overwrite
step.session.start_hail: true
41 changes: 41 additions & 0 deletions src/ot_orchestration/dags/credible_set_qc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""Airflow DAG for the credible set qc."""

from __future__ import annotations

from pathlib import Path

from airflow.models.dag import DAG

from ot_orchestration.utils import chain_dependencies, read_yaml_config
from ot_orchestration.utils.common import shared_dag_args, shared_dag_kwargs
from ot_orchestration.utils.dataproc import (
generate_dataproc_task_chain,
submit_gentropy_step,
)

CONFIG_FILE_PATH = Path(__file__).parent / "config" / "credible_set_qc.yaml"
config = read_yaml_config(CONFIG_FILE_PATH)

with DAG(
dag_id=Path(__file__).stem,
description="Open Targets Genetics — CredibleSet QC ",
default_args=shared_dag_args,
**shared_dag_kwargs,
):
tasks = {}
for step in config["nodes"]:
task = submit_gentropy_step(
cluster_name=config["dataproc"]["cluster_name"],
step_name=step["id"],
python_main_module=config["dataproc"]["python_main_module"],
params=step["params"],
)
tasks[step["id"]] = task

chain_dependencies(nodes=config["nodes"], tasks_or_task_groups=tasks)
dag = generate_dataproc_task_chain(
cluster_name=config["dataproc"]["cluster_name"],
cluster_init_script=config["dataproc"]["cluster_init_script"],
cluster_metadata=config["dataproc"]["cluster_metadata"],
tasks=[t for t in tasks.values()],
)