Skip to content

Commit

Permalink
feat(credible set qc dag): added dag and docs (#59)
Browse files Browse the repository at this point in the history
  • Loading branch information
project-defiant authored Oct 25, 2024
1 parent 0ef3545 commit 1a4a537
Show file tree
Hide file tree
Showing 6 changed files with 166 additions and 0 deletions.
17 changes: 17 additions & 0 deletions docs/credible_set_qc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Credible set qc dag

Credible set qc is a set of operations performed on the `StudyLocus` datasets originally finemapped by OpenTargets to:

- Ensure pValue of each locus does meet the pre-defined threshold
- Perform repartitioning of the credible sets, as the output from the batch job contains files per loci, resulting in slow queries.
- Ensure no duplicated loci exist in the clean credible sets.

![credible_set_qc](credible_set_qc.svg)

The dag contains following steps:

- qc of credible sets coming from `gwas_catalog_sumstats_susie` bucket
- qc of credible sets coming from `ukb_ppp_eur_data` bucket

> [!NOTE]
> The outputs of the steps are contained in the target bucket with prefix _credible_set_clean_.
62 changes: 62 additions & 0 deletions docs/credible_set_qc/credible_set_qc.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions docs/datasources/gwas_catalog_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,7 @@ Bucket `gs://gwas_catalog_sumstats_susie` contains:

```
gs://gwas_catalog_sumstats_susie/credible_set_datasets/
gs://gwas_catalog_sumstats_susie/credible_sets_clean/
gs://gwas_catalog_sumstats_susie/finemapping_logs/
gs://gwas_catalog_sumstats_susie/finemapping_manifests/
gs://gwas_catalog_sumstats_susie/study_index/
Expand Down Expand Up @@ -324,6 +325,11 @@ The output of finemapping can be found under the:
- `gs://gwas_catalog_sumstats_susie/finemapping_manifests/` - manifests used during the fine mapping job
- `gs://gwas_catalog_sumstats_susie/finemapping_logs/` - logs from the individual finemapping tasks

### Credible set qc

After the finemapping is performed, the qc dag is run. For more detail see [credible set qc dag](../../credible_set_qc/README.md)
The final credible sets are collected in the `gs://gwas_catalog_sumstats_susie/credible_set_clean/`

#### Parametrization of google batch finemapping job

The configuration of the google batch infrastructure and individual step parameters can be found in `gwas_catalog_sumstats_susie_finemapping.yaml` file.
Expand Down
6 changes: 6 additions & 0 deletions docs/datasources/ukb_ppp_eur_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Data stored under `gs://ukb_ppp_eur_data` bucket comes with following structure

```
gs://ukb_ppp_eur_data/credible_set_datasets/susie
gs://ukb_ppp_eur_data/credible_set_clean/
gs://ukb_ppp_eur_data/docs/
gs://ukb_ppp_eur_data/finemapping_logs/
gs://ukb_ppp_eur_data/finemapping_manifests/
Expand Down Expand Up @@ -109,6 +110,11 @@ The output of finemapping can be found under the:
- `gs://ukb_ppp_eur_data/finemapping_manifests/` - manifests used during the fine mapping job
- `gs://ukb_ppp_eur_data/finemapping_logs/` - logs from the individual finemapping tasks

### Credible set qc

After the finemapping is performed, the qc dag is run. For more detail see [credible set qc dag](../../credible_set_qc/README.md).
The final credible sets are collected in the `gs://ukb_ppp_eur_data/credible_set_clean/`.

#### Parametrization of google batch finemapping job

The configuration of the google batch infrastructure and individual step parameters can be found in `ukb_ppp_eur_finemapping.yaml` file.
Expand Down
34 changes: 34 additions & 0 deletions src/ot_orchestration/dags/config/credible_set_qc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
dataproc:
python_main_module: gs://genetics_etl_python_playground/initialisation/gentropy/dev/cli.py
cluster_metadata:
PACKAGE: gs://genetics_etl_python_playground/initialisation/gentropy/dev/gentropy-0.0.0-py3-none-any.whl
cluster_init_script: gs://genetics_etl_python_playground/initialisation/gentropy/dev/install_dependencies_on_cluster.sh
cluster_name: otg-credible-set-qc
autoscaling_policy: otg-etl

nodes:
- id: gwas_catalog_sumstats_susie_credible_set_qc
kind: Task
prerequisites: []
params:
step: credible_set_qc
step.credible_sets_path: gs://gwas_catalog_sumstats_susie/credible_set_datasets
step.output_path: gs://gwas_catalog_sumstats_susie/credible_set_clean
step.p_value_threshold: 1.0e-5
step.purity_min_r2: 0.01
step.n_partitions: 200
step.session.write_mode: overwrite
step.session.start_hail: true

- id: ukb_ppp_eur_data_credible_set_qc
kind: Task
prerequisites: []
params:
step: credible_set_qc
step.credible_sets_path: gs://ukb_ppp_eur_data/credible_set_datasets/susie
step.output_path: gs://ukb_ppp_eur_data/credible_set_clean
step.p_value_threshold: 1.0e-5
step.purity_min_r2: 0.01
step.n_partitions: 50
step.session.write_mode: overwrite
step.session.start_hail: true
41 changes: 41 additions & 0 deletions src/ot_orchestration/dags/credible_set_qc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""Airflow DAG for the credible set qc."""

from __future__ import annotations

from pathlib import Path

from airflow.models.dag import DAG

from ot_orchestration.utils import chain_dependencies, read_yaml_config
from ot_orchestration.utils.common import shared_dag_args, shared_dag_kwargs
from ot_orchestration.utils.dataproc import (
generate_dataproc_task_chain,
submit_gentropy_step,
)

CONFIG_FILE_PATH = Path(__file__).parent / "config" / "credible_set_qc.yaml"
config = read_yaml_config(CONFIG_FILE_PATH)

with DAG(
dag_id=Path(__file__).stem,
description="Open Targets Genetics — CredibleSet QC ",
default_args=shared_dag_args,
**shared_dag_kwargs,
):
tasks = {}
for step in config["nodes"]:
task = submit_gentropy_step(
cluster_name=config["dataproc"]["cluster_name"],
step_name=step["id"],
python_main_module=config["dataproc"]["python_main_module"],
params=step["params"],
)
tasks[step["id"]] = task

chain_dependencies(nodes=config["nodes"], tasks_or_task_groups=tasks)
dag = generate_dataproc_task_chain(
cluster_name=config["dataproc"]["cluster_name"],
cluster_init_script=config["dataproc"]["cluster_init_script"],
cluster_metadata=config["dataproc"]["cluster_metadata"],
tasks=[t for t in tasks.values()],
)

0 comments on commit 1a4a537

Please sign in to comment.