Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518

Merged
merged 35 commits into from
Mar 6, 2024

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Mar 4, 2024

This PR processes all eQTL Catalogue fine mapping results.

  • We are now ingesting QTLs for all quantification methods: gene level, transcript level, transcript usage,... All of them are described here.
  • Having more granular QTLs impact the studyId definition. We go from publication_tissue_gene to publication_tissue_measuredtrait.
  • The lead variant in the locus is now derived from the variant with the highest posterior probability. Finngen fine mapping results has also been changed to accommodate this.
  • New StudyLocus.calculate_credible_set_log10bf method. This aggregates all single LBFs in a credible set.

The processing job takes 20 minutes. Full DAG will be ~50 minutes.

This is how the DAG looks like
image

d0choa and others added 28 commits February 21, 2024 11:30
@github-actions github-actions bot added documentation Improvements or additions to documentation size-M labels Mar 4, 2024
@ireneisdoomed ireneisdoomed marked this pull request as ready for review March 6, 2024 09:31
@@ -325,3 +327,24 @@ def parse_efos(efo_uri: Column) -> Column:
"""
colname = efo_uri._jc.toString()
return f.array_sort(f.expr(f"regexp_extract_all(`{colname}`, '([A-Z]+_[0-9]+)')"))


def get_logsum(arr: NDArray[np.float64]) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function only makes sense in the context of StudyLocus. It can be a method there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and probably make it private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! As discussed, the StudyLocus incorporates a wrapper around this function that uses it in the specific context of calculating log10BFs for the credible set.
The same function is used differently for COLOC to calculate posterior probabilities for all H4 hypothesis, so not exactly StudyLocus specific.
We'll leave it like this

Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Think about the logsum and we can merge it.

@@ -325,3 +327,24 @@ def parse_efos(efo_uri: Column) -> Column:
"""
colname = efo_uri._jc.toString()
return f.array_sort(f.expr(f"regexp_extract_all(`{colname}`, '([A-Z]+_[0-9]+)')"))


def get_logsum(arr: NDArray[np.float64]) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and probably make it private

Returns:
DataFrame: Log Bayes Factors DataFrame.
"""
return session.spark.read.csv(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we clearly have different ways of understanding the world 🤣

@@ -40,24 +43,45 @@ class EqtlCatalogueStudyIndex:
StructField("quant_method", StringType(), True),
]
)
raw_studies_metadata_path = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/master/data_tables/dataset_metadata.tsv"
raw_studies_metadata_path = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/19929ff6a99bf402194292a14f96f9615b35f65f/data_tables/dataset_metadata.tsv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@ireneisdoomed ireneisdoomed merged commit c138be4 into dev Mar 6, 2024
4 checks passed
@ireneisdoomed ireneisdoomed deleted the il-eqtl-all-susies branch March 6, 2024 17:16
DSuveges pushed a commit that referenced this pull request Mar 8, 2024
…apping results (#518)

* feat: dataflow decompress prototype (#501)

* chore: commit susie results gist

* feat(study_index): add `tissueFromSourceId` to schema and make `traitFromSource` nullable

* fix: bug and linting fixes in new eqtl ingestion step

* perf: config bugfixes and performance improvements

* perf: remove data persistance to avoid executor failure

* perf: load susie results for studies of interest only

* perf: collect locus for leads only and optimise partitioning cols

* feat: parametrise methods to include

* feat: run full dag

* test: add tests

* fix: reorder test inputs

* docs: update eqtl catalogue docs

* fix: correct typos in tests docstrings

* refactor: change mqtl_quantification_methods to mqtl_quantification_methods_blacklist

* feat: studyId is based on measured trait and not on gene

* feat: credible set lead is the variant with highest pip

* feat(studies): change logic in _identify_study_type to extract qtl type based on quantization method

* refactor: externalise reading logic to source classes

* chore: add mqtl_quantification_methods_blacklist to yaml config

* docs: update docs

* fix(dag): pass bucket name to GCSDeleteBucketOperator

* refactor(coloc): move get_logsum function to common utils

* feat(studylocus): add calculate_credible_set_log10bf and use it for eqtlcat credible sets

* fix: credible sets dataset is too large and cant be broadcasted

* fix(dag): use GCSDeleteObjectsOperator instead of GCSDeleteBucketOperator

* fix: correct typo

* fix: correct typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants