Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extract credible sets and studies from eQTL Catalogue finemapping results #514

Merged
merged 23 commits into from
Mar 4, 2024

Conversation

ireneisdoomed
Copy link
Contributor

This PR includes:

  • New DAG that generates credible_set and study_index datasets based on the eQTL Catalogue finemapping results for all eQTLs that significantly influences gene expression.
    image
  • The process takes around 40 minutes and it introduces a novel approach to preprocess input files: instead of handling the compressed TSVs with Spark, we use Dataflow to decompress and store in a temporary bucket so that the ingestion is parallelised.

Metrics

  • Nr of studies: 317 911
  • Nr of credible sets: 385 100
  • Stats about the size of each credible set:
+-------+------------------+                                                    
|summary|       credSetSize|
+-------+------------------+
|  count|            385100|
|   mean|31.648021293170604|
| stddev|  99.3423119691853|
|    min|                 1|
|    25%|                 3|
|    50%|                11|
|    75%|                32|
|    max|              4090|
+-------+------------------+
  • Stats about the number of credible sets per study:
+----------------+------+                                                       
|nCredSetPerStudy| count|
+----------------+------+
|              10|     2|
|               9|     3|
|               8|     6|
|               7|    25|
|               6|    71|
|               5|   313|
|               4|  1232|
|               3|  7555|
|               2| 46542|
|               1|262162|
+----------------+------+

Note
This PR only processes results where the quantification method is ge.
Based on a discussion with @DSuveges and @d0choa, we will bring credible sets from all methods. I suggest that these changes happen in a subsequent PR, to checkpoint the work achieved here in case we want to revert.

coauthored with @d0choa

Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful!

src/gentropy/config.py Show resolved Hide resolved
StructField("sample_group", StringType(), True),
StructField("tissue_id", StringType(), True),
StructField("tissue_label", StringType(), True),
StructField("condition_label", StringType(), True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be added to the study index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yakov mentioned that carrying it over in the study index should be enough.

src/gentropy/eqtl_catalogue.py Show resolved Hide resolved
@ireneisdoomed ireneisdoomed merged commit ec9d2c7 into dev Mar 4, 2024
3 checks passed
@ireneisdoomed ireneisdoomed deleted the il-eqtl-susie branch July 15, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants