-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(l2g): add coloc based features #400
Conversation
) | ||
).persist() | ||
|
||
intercept = 0.0001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of the maximum PP in the neighborhood is the log of the difference between the maximum PP for the gene and the maximum PP across genes. I add this intercept so that when these 2 match up I can calculate the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the logic here. The difference can be negative.
@@ -145,7 +162,7 @@ def __post_init__(self: LocusToGeneStep) -> None: | |||
study_locus=credible_set, | |||
study_index=studies, | |||
variant_gene=v2g, | |||
# colocalisation=coloc, | |||
colocalisation=coloc, | |||
) | |||
|
|||
# Join and fill null values with 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we add 0 instead of Null in features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Most models can't handle missing values so you need to treat them somehow at the preprocessing step. This is what we used to do in production. However, XGBoost has a smart strategy to handle nulls and it's been in my "to try" list. XGBoost is like an ensemble of decision trees, and theoretically when there's a null in a node it evaluates the performance in each possible direction and it goes in the direction where the loss is smaller.
I haven't experimented with this yet, so I can't estimate the impact. Implementation wise, it should be as easy as retraining without filling nulls. Should I give it a go?
A smarter way to go around nulls is to infer them. But I think we'd need different imputation strategies per feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my previous experience with gradient boosting was as well to fill nulls with 0s. I see why the original L2G had to do the same and why this is our default strategy. Imputation is usually not recommended when the proportion of nulls is too big (as is our case for some features).
Exploring the impact of dealing differently with the nulls is potentially useful but goes beyond this PR. We can queue it with all the other L2G improvements that @addramir has in mind and revisit it after first release.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #400 +/- ##
==========================================
+ Coverage 85.67% 85.81% +0.14%
==========================================
Files 89 96 +7
Lines 2101 2595 +494
==========================================
+ Hits 1800 2227 +427
- Misses 301 368 +67
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there...
This PR includes changes in the L2G pipeline to bring eCaviar based features.