feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518

ireneisdoomed · 2024-03-04T13:58:48Z

This PR processes all eQTL Catalogue fine mapping results.

We are now ingesting QTLs for all quantification methods: gene level, transcript level, transcript usage,... All of them are described here.
Having more granular QTLs impact the studyId definition. We go from publication_tissue_gene to publication_tissue_measuredtrait.
The lead variant in the locus is now derived from the variant with the highest posterior probability. Finngen fine mapping results has also been changed to accommodate this.
New StudyLocus.calculate_credible_set_log10bf method. This aggregates all single LBFs in a credible set.

The processing job takes 20 minutes. Full DAG will be ~50 minutes.

This is how the DAG looks like

…FromSource` nullable

… into il-eqtl-susie

…-eqtl-susie

… into il-eqtl-susie

…ethods_blacklist

…pe based on quantization method

…-eqtl-all-susies

…qtlcat credible sets

…ator

…-eqtl-all-susies

d0choa · 2024-03-06T15:25:35Z

src/gentropy/common/utils.py

@@ -325,3 +327,24 @@ def parse_efos(efo_uri: Column) -> Column:
    """
    colname = efo_uri._jc.toString()
    return f.array_sort(f.expr(f"regexp_extract_all(`{colname}`, '([A-Z]+_[0-9]+)')"))
+
+
+def get_logsum(arr: NDArray[np.float64]) -> float:


I think this function only makes sense in the context of StudyLocus. It can be a method there.

and probably make it private

Thank you! As discussed, the StudyLocus incorporates a wrapper around this function that uses it in the specific context of calculating log10BFs for the credible set.
The same function is used differently for COLOC to calculate posterior probabilities for all H4 hypothesis, so not exactly StudyLocus specific.
We'll leave it like this

d0choa

Looks great. Think about the logsum and we can merge it.

d0choa · 2024-03-06T15:38:52Z

src/gentropy/common/utils.py

@@ -325,3 +327,24 @@ def parse_efos(efo_uri: Column) -> Column:
    """
    colname = efo_uri._jc.toString()
    return f.array_sort(f.expr(f"regexp_extract_all(`{colname}`, '([A-Z]+_[0-9]+)')"))
+
+
+def get_logsum(arr: NDArray[np.float64]) -> float:


and probably make it private

d0choa · 2024-03-06T15:40:08Z

src/gentropy/datasource/eqtl_catalogue/finemapping.py

+        Returns:
+            DataFrame: Log Bayes Factors DataFrame.
+        """
+        return session.spark.read.csv(


we clearly have different ways of understanding the world 🤣

d0choa · 2024-03-06T15:40:23Z

src/gentropy/datasource/eqtl_catalogue/study_index.py

@@ -40,24 +43,45 @@ class EqtlCatalogueStudyIndex:
            StructField("quant_method", StringType(), True),
        ]
    )
-    raw_studies_metadata_path = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/master/data_tables/dataset_metadata.tsv"
+    raw_studies_metadata_path = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/19929ff6a99bf402194292a14f96f9615b35f65f/data_tables/dataset_metadata.tsv"


…apping results (#518) * feat: dataflow decompress prototype (#501) * chore: commit susie results gist * feat(study_index): add `tissueFromSourceId` to schema and make `traitFromSource` nullable * fix: bug and linting fixes in new eqtl ingestion step * perf: config bugfixes and performance improvements * perf: remove data persistance to avoid executor failure * perf: load susie results for studies of interest only * perf: collect locus for leads only and optimise partitioning cols * feat: parametrise methods to include * feat: run full dag * test: add tests * fix: reorder test inputs * docs: update eqtl catalogue docs * fix: correct typos in tests docstrings * refactor: change mqtl_quantification_methods to mqtl_quantification_methods_blacklist * feat: studyId is based on measured trait and not on gene * feat: credible set lead is the variant with highest pip * feat(studies): change logic in _identify_study_type to extract qtl type based on quantization method * refactor: externalise reading logic to source classes * chore: add mqtl_quantification_methods_blacklist to yaml config * docs: update docs * fix(dag): pass bucket name to GCSDeleteBucketOperator * refactor(coloc): move get_logsum function to common utils * feat(studylocus): add calculate_credible_set_log10bf and use it for eqtlcat credible sets * fix: credible sets dataset is too large and cant be broadcasted * fix(dag): use GCSDeleteObjectsOperator instead of GCSDeleteBucketOperator * fix: correct typo * fix: correct typo

d0choa and others added 28 commits February 21, 2024 11:30

feat: dataflow decompress prototype (#501)

d39e931

chore: commit susie results gist

ce1c38c

feat(study_index): add tissueFromSourceId to schema and make `trait…

71f4b8a

…FromSource` nullable

Merge branch 'il-eqtl-susie' of https://github.com/opentargets/gentropy…

26c6610

… into il-eqtl-susie

fix: bug and linting fixes in new eqtl ingestion step

2e838ee

perf: config bugfixes and performance improvements

3d78779

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

b03d113

…-eqtl-susie

perf: remove data persistance to avoid executor failure

9666a4a

perf: load susie results for studies of interest only

d438dd2

perf: collect locus for leads only and optimise partitioning cols

70ff79b

feat: parametrise methods to include

8e294a1

feat: run full dag

01fa5d4

test: add tests

1e2dd2d

Merge branch 'dev' into il-eqtl-susie

de52a3f

fix: reorder test inputs

d137502

docs: update eqtl catalogue docs

419b41e

Merge branch 'il-eqtl-susie' of https://github.com/opentargets/gentropy…

9fb63ed

… into il-eqtl-susie

fix: correct typos in tests docstrings

ab7d01b

refactor: change mqtl_quantification_methods to mqtl_quantification_m…

75e5a7f

…ethods_blacklist

feat: studyId is based on measured trait and not on gene

cae9c4f

feat: credible set lead is the variant with highest pip

579722c

feat(studies): change logic in _identify_study_type to extract qtl ty…

1d42b14

…pe based on quantization method

refactor: externalise reading logic to source classes

d3e782d

chore: add mqtl_quantification_methods_blacklist to yaml config

532ea61

docs: update docs

e860f89

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

c122d3b

…-eqtl-all-susies

chore: merge

3ff534a

fix(dag): pass bucket name to GCSDeleteBucketOperator

fea50aa

github-actions bot added documentation Improvements or additions to documentation size-M labels Mar 4, 2024

github-actions bot added Step Feature airflow Datasource labels Mar 4, 2024

ireneisdoomed added 4 commits March 5, 2024 15:13

refactor(coloc): move get_logsum function to common utils

c11d8de

feat(studylocus): add calculate_credible_set_log10bf and use it for e…

ae722ae

…qtlcat credible sets

fix: credible sets dataset is too large and cant be broadcasted

5022ba5

fix(dag): use GCSDeleteObjectsOperator instead of GCSDeleteBucketOper…

0870034

…ator

github-actions bot added Method Dataset labels Mar 5, 2024

ireneisdoomed added 3 commits March 5, 2024 17:14

fix: correct typo

dc5129f

fix: correct typo

7f9c0f2

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

6553892

…-eqtl-all-susies

ireneisdoomed marked this pull request as ready for review March 6, 2024 09:31

d0choa reviewed Mar 6, 2024

View reviewed changes

d0choa approved these changes Mar 6, 2024

View reviewed changes

ireneisdoomed merged commit c138be4 into dev Mar 6, 2024
4 checks passed

ireneisdoomed deleted the il-eqtl-all-susies branch March 6, 2024 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518

feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518

ireneisdoomed commented Mar 4, 2024 •

edited

Loading

d0choa Mar 6, 2024

d0choa Mar 6, 2024

ireneisdoomed Mar 6, 2024

d0choa left a comment

d0choa Mar 6, 2024

d0choa Mar 6, 2024

d0choa Mar 6, 2024

feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518

feat: extract credible sets and studies from all eQTL Catalogue finemapping results #518

Conversation

ireneisdoomed commented Mar 4, 2024 • edited Loading

d0choa Mar 6, 2024

Choose a reason for hiding this comment

d0choa Mar 6, 2024

Choose a reason for hiding this comment

ireneisdoomed Mar 6, 2024

Choose a reason for hiding this comment

d0choa left a comment

Choose a reason for hiding this comment

d0choa Mar 6, 2024

Choose a reason for hiding this comment

d0choa Mar 6, 2024

Choose a reason for hiding this comment

d0choa Mar 6, 2024

Choose a reason for hiding this comment

ireneisdoomed commented Mar 4, 2024 •

edited

Loading