feat(eqtl_catalogue): study index improvements #369

ireneisdoomed · 2023-12-20T16:54:25Z

This PR includes several changes to the eQTL Catalogue study index parser:

Added metadata about the LD structure of GTEx
Full revamp of how metadata is added on a per study basis. I have created a configuration dictionary, to build the different study attributes based on the eQTL's Catalogue project.
- Right now we only have metadata about GTEx, but more could be added
- This fixes the issue where eQTL's catalogue was only usable for GTEx studies
Redefinition of studyId, now studies are also split by gene.
Fixes in the patterns that parsed studyId to extract the gene

ireneisdoomed

Re: df5a074
This pattern didn't work. modified_id is my suggested fix:

(
    df.select(f.lit("PROJECT_QTLGROUP_GENEID").alias("original_id"))
    .withColumn("current_id", f.regexp_extract(f.col("original_id"), r"(.*)_[\_]+", 1))
    .withColumn("modified_id", f.regexp_extract(f.col("original_id"), r"(.*)_[^_]+", 1))
    .show(1)
)
+--------------------+----------+----------------+
|         original_id|current_id|     modified_id|
+--------------------+----------+----------------+
|PROJECT_QTLGROUP_...|          |PROJECT_QTLGROUP|
+--------------------+----------+----------------+

into il-eqtl-study

codecov-commenter · 2023-12-21T15:05:56Z

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (34ed2b2) 86.84%.
Report is 44 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #369      +/-   ##
==========================================
+ Coverage   85.67%   86.84%   +1.16%     
==========================================
  Files          89       92       +3     
  Lines        2101     2447     +346     
==========================================
+ Hits         1800     2125     +325     
- Misses        301      322      +21

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <ø> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/otg/dataset/l2g_feature_matrix.py	`82.92% <ø> (+7.31%)`	⬆️
src/otg/dataset/study_locus.py	`96.20% <100.00%> (+0.04%)`	⬆️
src/otg/datasource/finngen/study_index.py	`100.00% <100.00%> (ø)`
src/otg/datasource/finngen/summary_stats.py	`100.00% <100.00%> (ø)`
src/otg/datasource/gwas_catalog/study_index.py	`100.00% <ø> (ø)`
src/otg/datasource/ukbiobank/study_index.py	`100.00% <ø> (ø)`
src/otg/l2g.py	`58.06% <ø> (ø)`
src/otg/method/l2g/evaluator.py	`38.70% <100.00%> (ø)`
... and 9 more

ireneisdoomed · 2023-12-21T17:05:29Z

I haven't yet run them, the job was running for 42min but got automatically cancelled.
There doesn't seem to be anything obvious in the logic/schema that'd make this crash.
I'll try running it as part of the new eQTLCatalogue DAG I've defined here

ireneisdoomed · 2024-01-09T08:47:47Z

The study index we had for eQTL Catalogue (preprocess/eqtl_catalogue/study_index) is not compatible with the associations datasets because of the reasons I cited above.
I had to generate a new study index with the fixes I propose. Since the full ingestion process was not scaling ( PR #366 ), I did a step ad hoc that parsed the study and extracted the geneId from already harmonised summary stats (preprocess/eqtl_catalogue/summary_stats).

The new study index is here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/study_index/eqtl_catalogue:

Number of studies: 1,207,976
Linking to 39,832 distinct genes in 49 diff tissues
Question (@tskir @DSuveges): in traitFromSource we are capturing the tissue instead of the measured transcript or the gene. Is this what we want?

Ad hoc script can be inspected here: gs://dataproc-staging-europe-west1-234703259993-hrobeqyg/google-cloud-dataproc-metainfo/6b1c3326-e1cf-470c-8e59-ac44d95fa4ac/jobs/1c7022660a0b4ab6b13d5e82069d21d3/staging/eqtl_study_index.py
Job ran in [6min] with latest changes (https://console.cloud.google.com/dataproc/jobs/1c7022660a0b4ab6b13d5e82069d21d3/configuration?region=europe-west1&project=open-targets-genetics-dev).

DSuveges · 2024-01-09T11:15:36Z

Sorry @ireneisdoomed , I had to jump on other urget things, so I couldn't sped much time on the PR. However regarding your questions:

in traitFromSource we are capturing the tissue instead of the measured transcript or the gene. Is this what we want?

No. The actual measurement was the expression level, so it should be the gene. I assume there was this problem that this field is mandatory for generating the study index, however without the summary stats, you don't know the gene, so no valid study index can be constructed. Given we already have a geneId column in the schema, the trait can be gene expression measurement/EFO_0600068 so we can generate valid study index, then explode with the measure gene identifier. Does it make sense? What do you think? @ireneisdoomed , @tskir

tskir

Thank you Irene, all changes look good to me and it's a good improvement & generalisation of the study index construction

tskir · 2024-01-09T11:33:08Z

@DSuveges @ireneisdoomed Regarding placement and meaning of study/index fields: the reason I went with the particular implementation I did was the least disruptive alignment with the existing schema.

The actual trait being measured is "expression of gene X in tissue Y", which isn't possible to express in a single column in any meaningful way.

So my idea of the least disruptive change was to set trait to tissue (implicitly meaning: trait = "expression in tissue Y"), and to populate gene as a separate column.

We could set the trait to the generic "gene expression measurement", but then we would have to necessarily add the "tissueId" column as well. Which isn't a bad thing, but is something to keep in mind I think.

If we decide to go with the "gene expression measurement" + tissueId + geneId way, I think it's best to do it as a separate PR to not keep this one waiting, but again that's up for discussion.

DSuveges · 2024-01-09T11:41:38Z

@tskir

If we decide to go with the "gene expression measurement" + tissueId + geneId way, I think it's best to do it as a separate PR to not keep this one waiting, but again that's up for discussion.

It's a complicated question, but I would not store tissue id in trait column to avoid ambiguity, however we'll soon have to deal with cell ids as well. I'm wondering if we should add a separate column for that too: tissueId, cellId. Let's assume we would not need to deal with temporal metadata. Yes, this question is beyond this PR.

tskir · 2024-01-09T11:45:29Z

@DSuveges This makes sense, and I like the idea with tissueId + cellId + geneId!

ireneisdoomed · 2024-01-09T12:28:31Z

Ty for the discussion @DSuveges @tskir
It's not critical for the work I'm doing, so let's leave it for another PR. @tskir, could you make the changes?

tskir · 2024-01-09T12:33:28Z

@ireneisdoomed Yes, I'm happy to make these changes. But just for clarity we're going ahead with merging this PR as is, right?

ireneisdoomed added 4 commits December 20, 2023 13:20

feat(eqtl): add ld population structure to gtex studies

3c4645c

feat(eqtl): add gtex study metadata conditionally

0c9234a

feat(eqtl): beautifully add gtex study metadata

e4514be

fix(eqtl): correct pattern that extracts studyid in sumstats

df5a074

ireneisdoomed commented Dec 20, 2023

View reviewed changes

ireneisdoomed added 2 commits December 21, 2023 13:20

feat(eqtl): include gene as part of the studyid

3e372a4

fix(eqtl): correct pattern that extracts geneid in study index

14e9f65

ireneisdoomed mentioned this pull request Dec 21, 2023

chore(study_index): change numeric columns from long to integers #371

Merged

ireneisdoomed and others added 2 commits December 21, 2023 15:49

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

8c9d277

into il-eqtl-study

Merge branch 'dev' into il-eqtl-study

2818200

ireneisdoomed changed the title ~~feat(eqtl_catalogue): add ld structure to gtex studies~~ feat(eqtl_catalogue): study index improvements Dec 21, 2023

ireneisdoomed marked this pull request as ready for review December 21, 2023 16:13

ireneisdoomed requested review from tskir and DSuveges December 21, 2023 17:03

ireneisdoomed and others added 3 commits January 8, 2024 13:40

Merge branch 'dev' into il-eqtl-study

df8bc57

fix: fix pattern that extracts grouped studyId and optimise statement

1154786

perf: optimise add_gene_to_study_id to explode and not to group

31799c1

ireneisdoomed mentioned this pull request Jan 9, 2024

feat(eqtl): add preprocessing dag #366

Closed

tskir approved these changes Jan 9, 2024

View reviewed changes

Merge branch 'dev' into il-eqtl-study

34ed2b2

ireneisdoomed merged commit 5d4955e into dev Jan 9, 2024
3 checks passed

ireneisdoomed deleted the il-eqtl-study branch January 9, 2024 12:34

tskir mentioned this pull request Jan 9, 2024

Rethink eQTL Catalogue data model opentargets/issues#3187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eqtl_catalogue): study index improvements #369

feat(eqtl_catalogue): study index improvements #369

ireneisdoomed commented Dec 20, 2023 •

edited

Loading

ireneisdoomed left a comment •

edited

Loading

codecov-commenter commented Dec 21, 2023 •

edited

Loading

ireneisdoomed commented Dec 21, 2023

ireneisdoomed commented Jan 9, 2024

DSuveges commented Jan 9, 2024

tskir left a comment

tskir commented Jan 9, 2024

DSuveges commented Jan 9, 2024

tskir commented Jan 9, 2024

ireneisdoomed commented Jan 9, 2024

tskir commented Jan 9, 2024

feat(eqtl_catalogue): study index improvements #369

feat(eqtl_catalogue): study index improvements #369

Conversation

ireneisdoomed commented Dec 20, 2023 • edited Loading

ireneisdoomed left a comment • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Dec 21, 2023 • edited Loading

Codecov Report

ireneisdoomed commented Dec 21, 2023

ireneisdoomed commented Jan 9, 2024

DSuveges commented Jan 9, 2024

tskir left a comment

Choose a reason for hiding this comment

tskir commented Jan 9, 2024

DSuveges commented Jan 9, 2024

tskir commented Jan 9, 2024

ireneisdoomed commented Jan 9, 2024

tskir commented Jan 9, 2024

ireneisdoomed commented Dec 20, 2023 •

edited

Loading

ireneisdoomed left a comment •

edited

Loading

codecov-commenter commented Dec 21, 2023 •

edited

Loading