-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(l2g): distance features based on weighted score #545
Conversation
…into il-l2g-weight-distance
In general the logic is correct, but I don't understand why you use the "score" and not the "distance" to TSS itself? To note, in the production we had three more features:
|
@addramir Thanks for the review! As discussed on Slack, using the score column makes sense a priori because it encodes the same information as using just the distance values but scaled. Normalising is a typical operation in feature engineering. Some metrics about the impact in performace of this run vs the 24.01 release:
This run had 3 major changes:
What we can do to make sure the AUC decrease is not due to these changes, is running L2G with the same data but with the previous definition of distance features. It can be done by end of day today. Let me know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine. We probably need to discuss it again in the future after we add susie FM.
This PR is based on #544 - please review after it is merged
✨ Context
While revisioning feature extraction for #544, I realised I wasn't factoring in PIPs to extract distance based features. By doing so, we will have a better representation of the distance contribution.
Latest L2G (24.03) reflect this changes.
🛠 What does this PR implement
The previous distance based features consisted in the raw number of minimum or average of bps that separated a variant to the gene's TSS (see column
distance
)As you can see, the score is a normalised value where higher scores denote more proximity.
This PR changes the value of the distance related features so that it is based on the score weighted by the variant's PIP, instead of simply the distance values.
🙈 Missing
Update W&B dashboard to understand changes.
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?