-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding logic to flag gwas catalog studies based on curation #347
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #347 +/- ##
==========================================
+ Coverage 85.67% 85.84% +0.17%
==========================================
Files 89 96 +7
Lines 2101 2593 +492
==========================================
+ Hits 1800 2226 +426
- Misses 301 367 +66
|
…ts/genetics_etl_python into ds_3173_study_curation
* feat: draft of gwas catalog preprocess inclusion dag * ci: new changelog and release notes templates (#357) Templates for CHANGELOG and release notes. To be fully tested on the next release. --------- Co-authored-by: David Ochoa <dogcaesar@gmail.com> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.
What I understand from the PR is:
- For associations, the moment we want to use the black list is when generating StudyLocus.
- Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list
src/otg/common/session.py
Outdated
@@ -124,21 +124,22 @@ def _create_merged_config( | |||
|
|||
def read_parquet( | |||
self: Session, | |||
path: str, | |||
path: str | list[str], | |||
schema: StructType, | |||
**kwargs: bool | float | int | str | None, | |||
) -> DataFrame: | |||
"""Reads parquet dataset with a provided schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Reads parquet dataset with a provided schema. | |
"""Reads a parquet or a list of parquet files with a provided schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/otg/common/session.py
Outdated
schema: StructType, | ||
**kwargs: bool | float | int | str | None, | ||
) -> DataFrame: | ||
"""Reads parquet dataset with a provided schema. | ||
|
||
Args: | ||
path (str): parquet dataset path | ||
path (str | list[str]): parquet dataset path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path (str | list[str]): parquet dataset path | |
path (str | list[str]): path to the parquet file or list of parquet files |
src/otg/dataset/dataset.py
Outdated
**kwargs: bool | float | int | str | None, | ||
) -> Self: | ||
"""Reads a parquet file into a Dataset with a given schema. | ||
|
||
Args: | ||
session (Session): Spark session | ||
path (str): Path to the parquet file | ||
path (str | list[str]): Path to the parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path (str | list[str]): Path to the parquet file | |
path (str | list[str]): Path to the parquet file or list of parquet files |
src/otg/dataset/dataset.py
Outdated
@@ -72,14 +72,14 @@ def get_schema(cls: type[Self]) -> StructType: | |||
def from_parquet( | |||
cls: type[Self], | |||
session: Session, | |||
path: str, | |||
path: str | list[str], | |||
**kwargs: bool | float | int | str | None, | |||
) -> Self: | |||
"""Reads a parquet file into a Dataset with a given schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Reads a parquet file into a Dataset with a given schema. | |
"""Reads a parquet or a list of parquet files into a Dataset with a given schema. |
src/otg/dataset/study_index.py
Outdated
@@ -139,3 +139,66 @@ def study_type_lut(self: StudyIndex) -> DataFrame: | |||
DataFrame: A dataframe containing `studyId` and `studyType` columns. | |||
""" | |||
return self.df.select("studyId", "studyType") | |||
|
|||
def get_eligible_gwas_study_ids(self: StudyIndex) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes the qualityControls
column is present, which is not true. If it is not, we'd filter out studies that are eligible.
I'd suggest adding a flag to check it exists, sth like:
filtered_df = self.df.filter(f.col("studyType") == "gwas")
if "qualityControls" in self.df.columns:
filtered_df = filtered_df.filter((f.size(f.col("qualityControls")) == 0) | (f.col("qualityControls").isNull()))
return [
row["studyId"]
for row in filtered_df.distinct().collect()
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right.
] | ||
|
||
curation_columns = [ | ||
"studyId", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These'd need to be changed according to the PR opentargets/curation#17 (review)
assert isinstance( | ||
mock_gwas_study_index.annotate_from_study_curation(mock_study_curation), | ||
StudyIndexGWASCatalog, | ||
), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}" | |
), f"When applied a study metadata table to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}" |
zero_return_count = mock_gwas_study_index.annotate_from_study_curation( | ||
None | ||
).df.count() | ||
return_count = mock_gwas_study_index.annotate_from_study_curation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return_count
and zero_return_count
are the same. is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero_return_count
asserts that the number of returned studies won't change even if there's no curation table provided for the curation funcion. Might be an overly cautious test, but that's why it's tested under the same funcition.
) | ||
] | ||
|
||
assert expected == observed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to here you are testing annotate_from_study_curation
. Ideally you could group them in a test Class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine.
"metadata": {} | ||
}, | ||
{ | ||
"name": "analysisFlags", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name can change depending of opentargets/curation#17 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.
What I understand from the PR is:
- For associations, the moment we want to use the black list is when generating StudyLocus.
- Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.
What I understand from the PR is:
- For associations, the moment we want to use the black list is when generating StudyLocus.
- Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list
Sorry for the stream of comments, I did the review from VSCode and something got stuck. |
…ts/genetics_etl_python into ds_3173_study_curation
@ireneisdoomed I went with Daniel through this PR. We identified a couple of things we really want to fix now and some that will come with follow-up PRs. There are a bunch of stylistic things (e.g. variable names) that we are not so sure they improve much so we might skip that for now. There is definitely some refactor material in this PR but there is enough critical logic to try to merge for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before there is a lot of business logic here. Some parts are more robust than others but there is an overall benefit in merging this logic and working on separate PRs for further improvements.
There is also some semantic debate about what exactly is metadata and how we distinguish our curation of study metadata vs the GWAS catalog curation association curation. Let's try not to forget about this because it might be confusing for people starting to work on the project.
We went through the PR with David and addressed these comments where it was necessary.
The main feature on this PR is to add logic to manage GWAS Catalog study curation. But it has some ripple effect on various pieces of the infrastructure.
Main bits being touched: