Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(l2g): distance features based on weighted score #545

Merged
merged 18 commits into from
Mar 19, 2024

Conversation

ireneisdoomed
Copy link
Contributor

This PR is based on #544 - please review after it is merged

✨ Context

While revisioning feature extraction for #544, I realised I wasn't factoring in PIPs to extract distance based features. By doing so, we will have a better representation of the distance contribution.

Latest L2G (24.03) reflect this changes.

🛠 What does this PR implement

The previous distance based features consisted in the raw number of minimum or average of bps that separated a variant to the gene's TSS (see column distance)

# variantId: 1_100225345_A_G
+---------------+--------+-----+                                                
|         geneId|distance|score|
+---------------+--------+-----+
|ENSG00000137992|   24489| 0.95|
|ENSG00000137996|   40871| 0.92|
|ENSG00000122477|   47078| 0.91|
|ENSG00000122435|   92195| 0.82|
|ENSG00000156876|   92390| 0.82|
|ENSG00000079335|  119656| 0.76|
|ENSG00000156875|  187250| 0.63|
|ENSG00000283761|  255334| 0.49|
|ENSG00000117620|  255994| 0.49|
|ENSG00000181656|  312794| 0.37|
|ENSG00000162688|  374984| 0.25|
|ENSG00000156869|  458710| 0.08|
|ENSG00000162692|  494397| 0.01|
+---------------+--------+-----+

As you can see, the score is a normalised value where higher scores denote more proximity.

This PR changes the value of the distance related features so that it is based on the score weighted by the variant's PIP, instead of simply the distance values.

🙈 Missing

Update W&B dashboard to understand changes.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@ireneisdoomed ireneisdoomed requested a review from addramir March 18, 2024 16:11
@addramir
Copy link
Contributor

In general the logic is correct, but I don't understand why you use the "score" and not the "distance" to TSS itself?
Like in formula ""f.col("score") * f.col("variantInLocusPosteriorProbability")" if you will use "distance" to tss instead of "score" it will give you the average distance to tss (dist_tss_ave feature).

To note, in the production we had three more features:

  1. Average distance to footprint (dist_foot_ave): the same as above but instead of distance to tss - distance to footprint of the gene.

  2. Neighbourhood score for dist_tss_ave: you calculate the minimal dist_tss_ave across the genes in the locus and for each gene calculate the relative score as: dist_tss_ave_nbh=-log(dist_tss_ave/min(dist_tss_ave)). For the closest gene it should be 0, for the distant genes it should be a negative value. Doesn't work for the cases when the min(dist_tss_ave)=0. Probably a bug in a prod.

  3. Neighbourhood score for dist_foot_ave: same as above but for footprint.

@ireneisdoomed
Copy link
Contributor Author

@addramir Thanks for the review!

As discussed on Slack, using the score column makes sense a priori because it encodes the same information as using just the distance values but scaled. Normalising is a typical operation in feature engineering.

Some metrics about the impact in performace of this run vs the 24.01 release:

  • 0.13% increase in accuracy
  • 0.59% decrease in AUC
  • 0.07% increase in F1

This run had 3 major changes:

  • more colocalisation results thanks to the eQTLCat credible sets
  • more colocalisation based features derived from newer molQTLs: tuQTL, sQTL, pQTL
  • the changes described in this PR

What we can do to make sure the AUC decrease is not due to these changes, is running L2G with the same data but with the previous definition of distance features. It can be done by end of day today. Let me know

Copy link
Contributor

@addramir addramir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine. We probably need to discuss it again in the future after we add susie FM.

@ireneisdoomed ireneisdoomed merged commit 160051c into dev Mar 19, 2024
4 checks passed
@ireneisdoomed ireneisdoomed deleted the il-l2g-weight-distance branch March 19, 2024 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants