feat(l2g): distance features based on weighted score #545

ireneisdoomed · 2024-03-15T13:03:51Z

This PR is based on #544 - please review after it is merged

✨ Context

While revisioning feature extraction for #544, I realised I wasn't factoring in PIPs to extract distance based features. By doing so, we will have a better representation of the distance contribution.

Latest L2G (24.03) reflect this changes.

🛠 What does this PR implement

The previous distance based features consisted in the raw number of minimum or average of bps that separated a variant to the gene's TSS (see column distance)

# variantId: 1_100225345_A_G
+---------------+--------+-----+                                                
|         geneId|distance|score|
+---------------+--------+-----+
|ENSG00000137992|   24489| 0.95|
|ENSG00000137996|   40871| 0.92|
|ENSG00000122477|   47078| 0.91|
|ENSG00000122435|   92195| 0.82|
|ENSG00000156876|   92390| 0.82|
|ENSG00000079335|  119656| 0.76|
|ENSG00000156875|  187250| 0.63|
|ENSG00000283761|  255334| 0.49|
|ENSG00000117620|  255994| 0.49|
|ENSG00000181656|  312794| 0.37|
|ENSG00000162688|  374984| 0.25|
|ENSG00000156869|  458710| 0.08|
|ENSG00000162692|  494397| 0.01|
+---------------+--------+-----+

As you can see, the score is a normalised value where higher scores denote more proximity.

This PR changes the value of the distance related features so that it is based on the score weighted by the variant's PIP, instead of simply the distance values.

🙈 Missing

Update W&B dashboard to understand changes.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-fm-improvements

…into il-l2g-weight-distance

addramir · 2024-03-18T16:34:46Z

In general the logic is correct, but I don't understand why you use the "score" and not the "distance" to TSS itself?
Like in formula ""f.col("score") * f.col("variantInLocusPosteriorProbability")" if you will use "distance" to tss instead of "score" it will give you the average distance to tss (dist_tss_ave feature).

To note, in the production we had three more features:

Average distance to footprint (dist_foot_ave): the same as above but instead of distance to tss - distance to footprint of the gene.
Neighbourhood score for dist_tss_ave: you calculate the minimal dist_tss_ave across the genes in the locus and for each gene calculate the relative score as: dist_tss_ave_nbh=-log(dist_tss_ave/min(dist_tss_ave)). For the closest gene it should be 0, for the distant genes it should be a negative value. Doesn't work for the cases when the min(dist_tss_ave)=0. Probably a bug in a prod.
Neighbourhood score for dist_foot_ave: same as above but for footprint.

ireneisdoomed · 2024-03-19T13:51:01Z

@addramir Thanks for the review!

As discussed on Slack, using the score column makes sense a priori because it encodes the same information as using just the distance values but scaled. Normalising is a typical operation in feature engineering.

Some metrics about the impact in performace of this run vs the 24.01 release:

0.13% increase in accuracy
0.59% decrease in AUC
0.07% increase in F1

This run had 3 major changes:

more colocalisation results thanks to the eQTLCat credible sets
more colocalisation based features derived from newer molQTLs: tuQTL, sQTL, pQTL
the changes described in this PR

What we can do to make sure the AUC decrease is not due to these changes, is running L2G with the same data but with the previous definition of distance features. It can be done by end of day today. Let me know

addramir

I think it is fine. We probably need to discuss it again in the future after we add susie FM.

ireneisdoomed added 9 commits March 12, 2024 12:31

refactor(l2g): streamline coloc feature factory

7c114d0

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

7723453

…-fm-improvements

perf(l2g): make joins in _get_vep_features lighter

c31786b

fix(l2g): use weighted scores for _get_vep_features

c0db573

refactor(l2g): minor improvements

ebae33f

perf(l2g): adapt session to set partition number for shuffling to 800

6340e89

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

f368731

…-fm-improvements

feat(l2g): distance features based on weighted score

28f760a

docs: update docs

d56692d

github-actions bot added size-M Method Dataset Step Feature labels Mar 15, 2024

ireneisdoomed and others added 7 commits March 15, 2024 11:47

refactor(l2g): select features of interest in data outside trainer

73f9913

chore(l2g): log annotated gold standards in w&b

36fb4fe

fix: update test_train

5fe9b4b

Merge branch 'dev' into il-log-l2g

7f84986

Merge branch 'dev' into il-log-l2g

8d2c943

Merge branch 'il-log-l2g' of https://github.com/opentargets/gentropy …

7ccb836

…into il-l2g-weight-distance

chore: merge

6fd3dfa

github-actions bot added size-S and removed size-M Dataset Step labels Mar 18, 2024

fix(l2g): typo in _get_tss_distance_features with no consequences

df7c83b

ireneisdoomed requested a review from addramir March 18, 2024 16:11

addramir approved these changes Mar 19, 2024

View reviewed changes

Merge branch 'dev' into il-l2g-weight-distance

8843396

ireneisdoomed mentioned this pull request Mar 19, 2024

Implement distance to footprint features opentargets/issues#3258

Closed

ireneisdoomed merged commit 160051c into dev Mar 19, 2024
4 checks passed

ireneisdoomed deleted the il-l2g-weight-distance branch March 19, 2024 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2g): distance features based on weighted score #545

feat(l2g): distance features based on weighted score #545

ireneisdoomed commented Mar 15, 2024

addramir commented Mar 18, 2024

ireneisdoomed commented Mar 19, 2024

addramir left a comment

feat(l2g): distance features based on weighted score #545

feat(l2g): distance features based on weighted score #545

Conversation

ireneisdoomed commented Mar 15, 2024

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

addramir commented Mar 18, 2024

ireneisdoomed commented Mar 19, 2024

addramir left a comment

Choose a reason for hiding this comment