feat: adding logic to flag gwas catalog studies based on curation #347

DSuveges · 2023-12-14T17:12:25Z

The main feature on this PR is to add logic to manage GWAS Catalog study curation. But it has some ripple effect on various pieces of the infrastructure.

Main bits being touched:

GWAS Catalog study class has method to update study information based on the provided curation table (optional).
This function updates study type, adds analysis flags, quality controls.
There's an other function to extract curation table for an other round of curation flagging new studies. (curation only applied for studies with summary statistics.)
The updated study index will regulate the number of studies being ingested for clumping.
To extract the eligible studies for clumping the study index dataset is updated with a function.

into ds_3173_study_curation

codecov-commenter · 2023-12-14T23:37:55Z

Codecov Report

Attention: 91 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (1dc24fc) 85.84%.
Report is 47 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #347      +/-   ##
==========================================
+ Coverage   85.67%   85.84%   +0.17%     
==========================================
  Files          89       96       +7     
  Lines        2101     2593     +492     
==========================================
+ Hits         1800     2226     +426     
- Misses        301      367      +66

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <100.00%> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/airflow/dags/gwas_catalog_harmonisation.py	`43.47% <ø> (ø)`
src/airflow/dags/gwas_curation_update.py	`100.00% <100.00%> (ø)`
src/otg/common/session.py	`87.50% <100.00%> (+0.32%)`	⬆️
src/otg/dataset/dataset.py	`91.80% <100.00%> (ø)`
src/otg/dataset/l2g_feature_matrix.py	`82.92% <ø> (+7.31%)`	⬆️
src/otg/dataset/study_locus.py	`96.20% <100.00%> (+0.04%)`	⬆️
src/otg/datasource/finngen/study_index.py	`100.00% <100.00%> (ø)`
src/otg/datasource/finngen/summary_stats.py	`100.00% <100.00%> (ø)`
... and 23 more

…ts/genetics_etl_python into ds_3173_study_curation

into ds_3173_study_curation

* feat: draft of gwas catalog preprocess inclusion dag * ci: new changelog and release notes templates (#357) Templates for CHANGELOG and release notes. To be fully tested on the next release. --------- Co-authored-by: David Ochoa <dogcaesar@gmail.com> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>

into ds_3173_study_curation

ireneisdoomed

This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.

What I understand from the PR is:

For associations, the moment we want to use the black list is when generating StudyLocus.
Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list

ireneisdoomed · 2023-12-22T11:57:49Z

src/otg/common/session.py

@@ -124,21 +124,22 @@ def _create_merged_config(

    def read_parquet(
        self: Session,
-        path: str,
+        path: str | list[str],
        schema: StructType,
        **kwargs: bool | float | int | str | None,
    ) -> DataFrame:
        """Reads parquet dataset with a provided schema.


Suggested change

"""Reads parquet dataset with a provided schema.

"""Reads a parquet or a list of parquet files with a provided schema.

ireneisdoomed · 2023-12-22T12:04:14Z

src/otg/common/session.py

        schema: StructType,
        **kwargs: bool | float | int | str | None,
    ) -> DataFrame:
        """Reads parquet dataset with a provided schema.

        Args:
-            path (str): parquet dataset path
+            path (str | list[str]): parquet dataset path


Suggested change

path (str | list[str]): parquet dataset path

path (str | list[str]): path to the parquet file or list of parquet files

ireneisdoomed · 2023-12-22T12:04:45Z

src/otg/dataset/dataset.py

        **kwargs: bool | float | int | str | None,
    ) -> Self:
        """Reads a parquet file into a Dataset with a given schema.

        Args:
            session (Session): Spark session
-            path (str): Path to the parquet file
+            path (str | list[str]): Path to the parquet file


Suggested change

path (str | list[str]): Path to the parquet file

path (str | list[str]): Path to the parquet file or list of parquet files

ireneisdoomed · 2023-12-22T12:05:19Z

src/otg/dataset/dataset.py

@@ -72,14 +72,14 @@ def get_schema(cls: type[Self]) -> StructType:
    def from_parquet(
        cls: type[Self],
        session: Session,
-        path: str,
+        path: str | list[str],
        **kwargs: bool | float | int | str | None,
    ) -> Self:
        """Reads a parquet file into a Dataset with a given schema.


Suggested change

"""Reads a parquet file into a Dataset with a given schema.

"""Reads a parquet or a list of parquet files into a Dataset with a given schema.

ireneisdoomed · 2023-12-22T12:34:55Z

src/otg/dataset/study_index.py

@@ -139,3 +139,66 @@ def study_type_lut(self: StudyIndex) -> DataFrame:
            DataFrame: A dataframe containing `studyId` and `studyType` columns.
        """
        return self.df.select("studyId", "studyType")
+
+    def get_eligible_gwas_study_ids(self: StudyIndex) -> list[str]:


This assumes the qualityControls column is present, which is not true. If it is not, we'd filter out studies that are eligible.

I'd suggest adding a flag to check it exists, sth like:

filtered_df = self.df.filter(f.col("studyType") == "gwas") if "qualityControls" in self.df.columns: filtered_df = filtered_df.filter((f.size(f.col("qualityControls")) == 0) | (f.col("qualityControls").isNull())) return [ row["studyId"] for row in filtered_df.distinct().collect() ]

Yes, that's right.

ireneisdoomed · 2023-12-22T18:13:50Z

tests/datasource/gwas_catalog/test_gwas_catalog_curation.py

+        ]
+
+        curation_columns = [
+            "studyId",


These'd need to be changed according to the PR opentargets/curation#17 (review)

ireneisdoomed · 2023-12-22T18:15:01Z

tests/datasource/gwas_catalog/test_gwas_catalog_curation.py

+        assert isinstance(
+            mock_gwas_study_index.annotate_from_study_curation(mock_study_curation),
+            StudyIndexGWASCatalog,
+        ), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"


Suggested change

), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"

), f"When applied a study metadata table to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"

ireneisdoomed · 2023-12-22T18:16:55Z

tests/datasource/gwas_catalog/test_gwas_catalog_curation.py

+        zero_return_count = mock_gwas_study_index.annotate_from_study_curation(
+            None
+        ).df.count()
+        return_count = mock_gwas_study_index.annotate_from_study_curation(


return_count and zero_return_count are the same. is this intended?

The zero_return_count asserts that the number of returned studies won't change even if there's no curation table provided for the curation funcion. Might be an overly cautious test, but that's why it's tested under the same funcition.

ireneisdoomed · 2023-12-22T18:18:38Z

tests/datasource/gwas_catalog/test_gwas_catalog_curation.py

+            )
+        ]
+
+        assert expected == observed


Up to here you are testing annotate_from_study_curation. Ideally you could group them in a test Class

I think this is fine.

ireneisdoomed · 2023-12-22T18:21:06Z

src/otg/assets/schemas/study_index.json

+      "metadata": {}
+    },
+    {
+      "name": "analysisFlags",


This name can change depending of opentargets/curation#17 (review)

ireneisdoomed

This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.

What I understand from the PR is:

For associations, the moment we want to use the black list is when generating StudyLocus.
Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list

ireneisdoomed

This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.

What I understand from the PR is:

For associations, the moment we want to use the black list is when generating StudyLocus.
Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list

ireneisdoomed · 2023-12-22T18:29:40Z

Sorry for the stream of comments, I did the review from VSCode and something got stuck.

into ds_3173_study_curation

…ts/genetics_etl_python into ds_3173_study_curation

d0choa · 2024-01-09T16:58:06Z

@ireneisdoomed I went with Daniel through this PR. We identified a couple of things we really want to fix now and some that will come with follow-up PRs. There are a bunch of stylistic things (e.g. variable names) that we are not so sure they improve much so we might skip that for now. There is definitely some refactor material in this PR but there is enough critical logic to try to merge for now.

d0choa

As discussed before there is a lot of business logic here. Some parts are more robust than others but there is an overall benefit in merging this logic and working on separate PRs for further improvements.

There is also some semantic debate about what exactly is metadata and how we distinguish our curation of study metadata vs the GWAS catalog curation association curation. Let's try not to forget about this because it might be confusing for people starting to work on the project.

We went through the PR with David and addressed these comments where it was necessary.

DSuveges added 9 commits December 14, 2023 16:57

feat: adding logic to flag gwas catalog studies based on curation

017094a

feat: adding curation step

93076fc

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

c6383f9

into ds_3173_study_curation

fix: fixing import

aaab399

feat: updating study schema

0ede97a

test: adding study curation test

b5908ff

test: improving test

3534c8c

fix: hopefully broken test got fixed

7ed4d08

fix: maybe now

96ef6f6

DSuveges added 5 commits December 15, 2023 00:40

test: increasing test coverage on study curation logic

540ecb7

refactor: move eligibility test function to study dataset

43f59db

test: adding test for new study fucntion

8f53469

test: completing test coverage on the curation funcitons

03578bc

Merge branch 'dev' into ds_3173_study_curation

0d3545d

DSuveges marked this pull request as ready for review December 15, 2023 11:14

DSuveges requested a review from d0choa December 15, 2023 11:14

Merge branch 'dev' into ds_3173_study_curation

8a86854

DSuveges linked an issue Dec 15, 2023 that may be closed by this pull request

Managing GWAS Catalog study QC/flags opentargets/issues#3173

Closed

DSuveges and others added 8 commits December 15, 2023 14:55

feat: adding interogator functions to study index

0183452

Merge branch 'ds_3173_study_curation' of https://github.com/opentarge…

63db7f6

…ts/genetics_etl_python into ds_3173_study_curation

feat: adding step to get eligible list

3d03158

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

367b10f

into ds_3173_study_curation

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

dc2e461

into ds_3173_study_curation

feat: prototyping dag

e57d66a

fix: yaml format might help

09762ce

ireneisdoomed requested changes Dec 22, 2023

View reviewed changes

ireneisdoomed previously requested changes Dec 22, 2023

View reviewed changes

DSuveges added 15 commits December 23, 2023 02:27

fix: updating regex to make it applicable to different sources

955b77e

chore: updating branch from origin/dev

6523780

feat: adding DAG for complete GWAS ingestion

3ed2720

chore: dealing with curation table column name changes

ea0e277

chore: update from main

f4f7f7b

test: removing test - no longer relevant

fed0998

docs: fixing docstring

af24dfe

refactor: splitting clumping step into two

0547be1

refactor: adjusting configuration to the split clumping step

cec29a0

refactor: updating config to the splitted clumping step

150347c

feat: adding DAG to update curation table

41b9ac6

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

281e1c4

into ds_3173_study_curation

fix: adding back eqtl catalog config

a2d0505

chore: final touches

95ccb7e

Merge branch 'dev' into ds_3173_study_curation

62823cc

DSuveges requested a review from ireneisdoomed January 8, 2024 12:21

DSuveges added 6 commits January 8, 2024 12:27

docs: adding spacer for ld clump step

9c58077

Merge branch 'ds_3173_study_curation' of https://github.com/opentarge…

aa56084

…ts/genetics_etl_python into ds_3173_study_curation

docs: addign spacer for window based clumpingj

4adcbb3

docs: addign spacer for window based clumping

a4017e9

Merge branch 'dev' into ds_3173_study_curation

80bccd6

Merge branch 'dev' into ds_3173_study_curation

cd9823d

DSuveges added 2 commits January 9, 2024 21:19

fix: final touches proise

0887139

Merge branch 'dev' into ds_3173_study_curation

1dc24fc

d0choa approved these changes Jan 10, 2024

View reviewed changes

DSuveges merged commit 77dee8e into dev Jan 10, 2024
3 checks passed

DSuveges deleted the ds_3173_study_curation branch January 10, 2024 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding logic to flag gwas catalog studies based on curation #347

feat: adding logic to flag gwas catalog studies based on curation #347

DSuveges commented Dec 14, 2023 •

edited

Loading

codecov-commenter commented Dec 14, 2023 •

edited

Loading

ireneisdoomed left a comment

ireneisdoomed Dec 22, 2023

DSuveges Jan 2, 2024

ireneisdoomed Dec 22, 2023

ireneisdoomed Dec 22, 2023

ireneisdoomed Dec 22, 2023

ireneisdoomed Dec 22, 2023

DSuveges Jan 2, 2024

ireneisdoomed Dec 22, 2023

ireneisdoomed Dec 22, 2023

ireneisdoomed Dec 22, 2023

DSuveges Jan 8, 2024

ireneisdoomed Dec 22, 2023

DSuveges Jan 8, 2024

ireneisdoomed Dec 22, 2023

ireneisdoomed left a comment

ireneisdoomed left a comment

ireneisdoomed commented Dec 22, 2023

d0choa commented Jan 9, 2024

d0choa left a comment

	"""Reads parquet dataset with a provided schema.
	"""Reads a parquet or a list of parquet files with a provided schema.

	path (str \| list[str]): parquet dataset path
	path (str \| list[str]): path to the parquet file or list of parquet files

	"""Reads a parquet file into a Dataset with a given schema.
	"""Reads a parquet or a list of parquet files into a Dataset with a given schema.

	), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"
	), f"When applied a study metadata table to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"

feat: adding logic to flag gwas catalog studies based on curation #347

feat: adding logic to flag gwas catalog studies based on curation #347

Conversation

DSuveges commented Dec 14, 2023 • edited Loading

codecov-commenter commented Dec 14, 2023 • edited Loading

Codecov Report

ireneisdoomed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ireneisdoomed left a comment

Choose a reason for hiding this comment

ireneisdoomed left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Dec 22, 2023

d0choa commented Jan 9, 2024

d0choa left a comment

Choose a reason for hiding this comment

DSuveges commented Dec 14, 2023 •

edited

Loading

codecov-commenter commented Dec 14, 2023 •

edited

Loading