Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(l2g): add coloc based features #400

Merged
merged 13 commits into from
Jan 12, 2024
Merged

feat(l2g): add coloc based features #400

merged 13 commits into from
Jan 12, 2024

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Jan 10, 2024

This PR includes changes in the L2G pipeline to bring eCaviar based features.

  • Business logic to extract the feature is practically unchanged, with the exception that the maximum CLPPs in the neighborhood are in logarithmic scale. L2G should learn to prioritise genes where this value is more negative.
  • Bug fixes in how features are built into the feature matrix
  • Removal of the overlap dependency. This adds extra logic to extract overlaps only for the study loci in the curation set that doesn't look very pretty.

)
).persist()

intercept = 0.0001
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of the maximum PP in the neighborhood is the log of the difference between the maximum PP for the gene and the maximum PP across genes. I add this intercept so that when these 2 match up I can calculate the log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the logic here. The difference can be negative.

@@ -145,7 +162,7 @@ def __post_init__(self: LocusToGeneStep) -> None:
study_locus=credible_set,
study_index=studies,
variant_gene=v2g,
# colocalisation=coloc,
colocalisation=coloc,
)

# Join and fill null values with 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we add 0 instead of Null in features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Most models can't handle missing values so you need to treat them somehow at the preprocessing step. This is what we used to do in production. However, XGBoost has a smart strategy to handle nulls and it's been in my "to try" list. XGBoost is like an ensemble of decision trees, and theoretically when there's a null in a node it evaluates the performance in each possible direction and it goes in the direction where the loss is smaller.
Screenshot_20240112-093558.png

I haven't experimented with this yet, so I can't estimate the impact. Implementation wise, it should be as easy as retraining without filling nulls. Should I give it a go?

A smarter way to go around nulls is to infer them. But I think we'd need different imputation strategies per feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my previous experience with gradient boosting was as well to fill nulls with 0s. I see why the original L2G had to do the same and why this is our default strategy. Imputation is usually not recommended when the proportion of nulls is too big (as is our case for some features).

Exploring the impact of dealing differently with the nulls is potentially useful but goes beyond this PR. We can queue it with all the other L2G improvements that @addramir has in mind and revisit it after first release.

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2024

Codecov Report

Attention: 93 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (ae1b7b9) 85.81%.
Report is 50 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #400      +/-   ##
==========================================
+ Coverage   85.67%   85.81%   +0.14%     
==========================================
  Files          89       96       +7     
  Lines        2101     2595     +494     
==========================================
+ Hits         1800     2227     +427     
- Misses        301      368      +67     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <100.00%> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/airflow/dags/gwas_catalog_harmonisation.py 43.47% <ø> (ø)
src/airflow/dags/gwas_curation_update.py 100.00% <100.00%> (ø)
src/otg/common/session.py 87.50% <100.00%> (+0.32%) ⬆️
src/otg/dataset/dataset.py 91.80% <100.00%> (ø)
src/otg/dataset/l2g_feature_matrix.py 82.92% <100.00%> (+7.31%) ⬆️
src/otg/dataset/l2g_prediction.py 90.90% <100.00%> (+0.43%) ⬆️
src/otg/dataset/study_locus.py 96.20% <100.00%> (+0.04%) ⬆️
src/otg/datasource/finngen/study_index.py 100.00% <100.00%> (ø)
... and 25 more

Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there...

@d0choa d0choa merged commit eeb9cd5 into dev Jan 12, 2024
3 checks passed
@d0choa d0choa deleted the il-l2g-coloc branch January 12, 2024 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants