Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eqtl_catalogue): study index improvements #369

Merged
merged 12 commits into from
Jan 9, 2024
Merged

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Dec 20, 2023

This PR includes several changes to the eQTL Catalogue study index parser:

  • Added metadata about the LD structure of GTEx
  • Full revamp of how metadata is added on a per study basis. I have created a configuration dictionary, to build the different study attributes based on the eQTL's Catalogue project.
    • Right now we only have metadata about GTEx, but more could be added
    • This fixes the issue where eQTL's catalogue was only usable for GTEx studies
  • Redefinition of studyId, now studies are also split by gene.
  • Fixes in the patterns that parsed studyId to extract the gene

Copy link
Contributor Author

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: df5a074
This pattern didn't work. modified_id is my suggested fix:

(
    df.select(f.lit("PROJECT_QTLGROUP_GENEID").alias("original_id"))
    .withColumn("current_id", f.regexp_extract(f.col("original_id"), r"(.*)_[\_]+", 1))
    .withColumn("modified_id", f.regexp_extract(f.col("original_id"), r"(.*)_[^_]+", 1))
    .show(1)
)
+--------------------+----------+----------------+
|         original_id|current_id|     modified_id|
+--------------------+----------+----------------+
|PROJECT_QTLGROUP_...|          |PROJECT_QTLGROUP|
+--------------------+----------+----------------+

@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2023

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (34ed2b2) 86.84%.
Report is 44 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #369      +/-   ##
==========================================
+ Coverage   85.67%   86.84%   +1.16%     
==========================================
  Files          89       92       +3     
  Lines        2101     2447     +346     
==========================================
+ Hits         1800     2125     +325     
- Misses        301      322      +21     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <ø> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/otg/dataset/l2g_feature_matrix.py 82.92% <ø> (+7.31%) ⬆️
src/otg/dataset/study_locus.py 96.20% <100.00%> (+0.04%) ⬆️
src/otg/datasource/finngen/study_index.py 100.00% <100.00%> (ø)
src/otg/datasource/finngen/summary_stats.py 100.00% <100.00%> (ø)
src/otg/datasource/gwas_catalog/study_index.py 100.00% <ø> (ø)
src/otg/datasource/ukbiobank/study_index.py 100.00% <ø> (ø)
src/otg/l2g.py 58.06% <ø> (ø)
src/otg/method/l2g/evaluator.py 38.70% <100.00%> (ø)
... and 9 more

@ireneisdoomed ireneisdoomed changed the title feat(eqtl_catalogue): add ld structure to gtex studies feat(eqtl_catalogue): study index improvements Dec 21, 2023
@ireneisdoomed ireneisdoomed marked this pull request as ready for review December 21, 2023 16:13
@ireneisdoomed
Copy link
Contributor Author

I haven't yet run them, the job was running for 42min but got automatically cancelled.
There doesn't seem to be anything obvious in the logic/schema that'd make this crash.
I'll try running it as part of the new eQTLCatalogue DAG I've defined here

@ireneisdoomed
Copy link
Contributor Author

The study index we had for eQTL Catalogue (preprocess/eqtl_catalogue/study_index) is not compatible with the associations datasets because of the reasons I cited above.
I had to generate a new study index with the fixes I propose. Since the full ingestion process was not scaling ( PR #366 ), I did a step ad hoc that parsed the study and extracted the geneId from already harmonised summary stats (preprocess/eqtl_catalogue/summary_stats).

The new study index is here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/study_index/eqtl_catalogue:

  • Number of studies: 1,207,976
  • Linking to 39,832 distinct genes in 49 diff tissues
  • Question (@tskir @DSuveges): in traitFromSource we are capturing the tissue instead of the measured transcript or the gene. Is this what we want?

Ad hoc script can be inspected here: gs://dataproc-staging-europe-west1-234703259993-hrobeqyg/google-cloud-dataproc-metainfo/6b1c3326-e1cf-470c-8e59-ac44d95fa4ac/jobs/1c7022660a0b4ab6b13d5e82069d21d3/staging/eqtl_study_index.py
Job ran in [6min] with latest changes (https://console.cloud.google.com/dataproc/jobs/1c7022660a0b4ab6b13d5e82069d21d3/configuration?region=europe-west1&project=open-targets-genetics-dev).

@DSuveges
Copy link
Contributor

DSuveges commented Jan 9, 2024

Sorry @ireneisdoomed , I had to jump on other urget things, so I couldn't sped much time on the PR. However regarding your questions:

in traitFromSource we are capturing the tissue instead of the measured transcript or the gene. Is this what we want?

No. The actual measurement was the expression level, so it should be the gene. I assume there was this problem that this field is mandatory for generating the study index, however without the summary stats, you don't know the gene, so no valid study index can be constructed. Given we already have a geneId column in the schema, the trait can be gene expression measurement/EFO_0600068 so we can generate valid study index, then explode with the measure gene identifier. Does it make sense? What do you think? @ireneisdoomed , @tskir

Copy link
Contributor

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Irene, all changes look good to me and it's a good improvement & generalisation of the study index construction

@tskir
Copy link
Contributor

tskir commented Jan 9, 2024

@DSuveges @ireneisdoomed Regarding placement and meaning of study/index fields: the reason I went with the particular implementation I did was the least disruptive alignment with the existing schema.

The actual trait being measured is "expression of gene X in tissue Y", which isn't possible to express in a single column in any meaningful way.

So my idea of the least disruptive change was to set trait to tissue (implicitly meaning: trait = "expression in tissue Y"), and to populate gene as a separate column.

We could set the trait to the generic "gene expression measurement", but then we would have to necessarily add the "tissueId" column as well. Which isn't a bad thing, but is something to keep in mind I think.

If we decide to go with the "gene expression measurement" + tissueId + geneId way, I think it's best to do it as a separate PR to not keep this one waiting, but again that's up for discussion.

@DSuveges
Copy link
Contributor

DSuveges commented Jan 9, 2024

@tskir

If we decide to go with the "gene expression measurement" + tissueId + geneId way, I think it's best to do it as a separate PR to not keep this one waiting, but again that's up for discussion.

It's a complicated question, but I would not store tissue id in trait column to avoid ambiguity, however we'll soon have to deal with cell ids as well. I'm wondering if we should add a separate column for that too: tissueId, cellId. Let's assume we would not need to deal with temporal metadata. Yes, this question is beyond this PR.

@tskir
Copy link
Contributor

tskir commented Jan 9, 2024

@DSuveges This makes sense, and I like the idea with tissueId + cellId + geneId!

@ireneisdoomed
Copy link
Contributor Author

Ty for the discussion @DSuveges @tskir
It's not critical for the work I'm doing, so let's leave it for another PR. @tskir, could you make the changes?

@tskir
Copy link
Contributor

tskir commented Jan 9, 2024

@ireneisdoomed Yes, I'm happy to make these changes. But just for clarity we're going ahead with merging this PR as is, right?

@ireneisdoomed ireneisdoomed merged commit 5d4955e into dev Jan 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants