feat(l2g): add coloc based features #400

ireneisdoomed · 2024-01-10T10:17:12Z

This PR includes changes in the L2G pipeline to bring eCaviar based features.

Business logic to extract the feature is practically unchanged, with the exception that the maximum CLPPs in the neighborhood are in logarithmic scale. L2G should learn to prioritise genes where this value is more negative.
Bug fixes in how features are built into the feature matrix
Removal of the overlap dependency. This adds extra logic to extract overlaps only for the study loci in the curation set that doesn't look very pretty.

into il-l2g-coloc

ireneisdoomed · 2024-01-10T17:03:32Z

src/otg/method/l2g/feature_factory.py

-        )
+        ).persist()
+
+        intercept = 0.0001


The value of the maximum PP in the neighborhood is the log of the difference between the maximum PP for the gene and the maximum PP across genes. I add this intercept so that when these 2 match up I can calculate the log.

I don't fully understand the logic here. The difference can be negative.

addramir · 2024-01-11T12:29:20Z

src/otg/l2g.py

@@ -145,7 +162,7 @@ def __post_init__(self: LocusToGeneStep) -> None:
                study_locus=credible_set,
                study_index=studies,
                variant_gene=v2g,
-                # colocalisation=coloc,
+                colocalisation=coloc,
            )

            # Join and fill null values with 0


Do we add 0 instead of Null in features?

That's right. Most models can't handle missing values so you need to treat them somehow at the preprocessing step. This is what we used to do in production. However, XGBoost has a smart strategy to handle nulls and it's been in my "to try" list. XGBoost is like an ensemble of decision trees, and theoretically when there's a null in a node it evaluates the performance in each possible direction and it goes in the direction where the loss is smaller.

I haven't experimented with this yet, so I can't estimate the impact. Implementation wise, it should be as easy as retraining without filling nulls. Should I give it a go?

A smarter way to go around nulls is to infer them. But I think we'd need different imputation strategies per feature.

my previous experience with gradient boosting was as well to fill nulls with 0s. I see why the original L2G had to do the same and why this is our default strategy. Imputation is usually not recommended when the proportion of nulls is too big (as is our case for some features).

Exploring the impact of dealing differently with the nulls is potentially useful but goes beyond this PR. We can queue it with all the other L2G improvements that @addramir has in mind and revisit it after first release.

into il-l2g-coloc

…s_etl_python into il-l2g-coloc

codecov-commenter · 2024-01-12T11:31:28Z

Codecov Report

Attention: 93 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (ae1b7b9) 85.81%.
Report is 50 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #400      +/-   ##
==========================================
+ Coverage   85.67%   85.81%   +0.14%     
==========================================
  Files          89       96       +7     
  Lines        2101     2595     +494     
==========================================
+ Hits         1800     2227     +427     
- Misses        301      368      +67

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <100.00%> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/airflow/dags/gwas_catalog_harmonisation.py	`43.47% <ø> (ø)`
src/airflow/dags/gwas_curation_update.py	`100.00% <100.00%> (ø)`
src/otg/common/session.py	`87.50% <100.00%> (+0.32%)`	⬆️
src/otg/dataset/dataset.py	`91.80% <100.00%> (ø)`
src/otg/dataset/l2g_feature_matrix.py	`82.92% <100.00%> (+7.31%)`	⬆️
src/otg/dataset/l2g_prediction.py	`90.90% <100.00%> (+0.43%)`	⬆️
src/otg/dataset/study_locus.py	`96.20% <100.00%> (+0.04%)`	⬆️
src/otg/datasource/finngen/study_index.py	`100.00% <100.00%> (ø)`
... and 25 more

d0choa

Getting there...

ireneisdoomed added 7 commits January 9, 2024 17:19

feat(l2g): adapt pipeline to include coloc features

3c94ff3

fix(colocalisationfactory): merge features after union

735d1c1

feat(l2g): extract overlaps on the fly

7371868

chore: remove typos in coloc features

b21e3e9

chore: harmonise names of coloc features

8743fc1

fix(l2g): _get_max_coloc_per_study_locus doesnt hardcode feature names

088957d

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

235ed90

into il-l2g-coloc

ireneisdoomed commented Jan 10, 2024

View reviewed changes

ireneisdoomed requested review from d0choa and addramir January 10, 2024 17:11

Merge branch 'dev' into il-l2g-coloc

5835491

addramir reviewed Jan 11, 2024

View reviewed changes

ireneisdoomed added 4 commits January 11, 2024 14:33

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

e4d7b71

into il-l2g-coloc

test(l2g): add

4b756ce

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

817763b

into il-l2g-coloc

Merge branch 'il-l2g-coloc' of https://github.com/opentargets/genetic…

4265d8b

…s_etl_python into il-l2g-coloc

chore: recover mypy ignore

ae1b7b9

d0choa approved these changes Jan 12, 2024

View reviewed changes

d0choa merged commit eeb9cd5 into dev Jan 12, 2024
3 checks passed

d0choa deleted the il-l2g-coloc branch January 12, 2024 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2g): add coloc based features #400

feat(l2g): add coloc based features #400

ireneisdoomed commented Jan 10, 2024 •

edited

Loading

ireneisdoomed Jan 10, 2024

addramir Jan 11, 2024

addramir Jan 11, 2024

ireneisdoomed Jan 12, 2024

d0choa Jan 12, 2024

codecov-commenter commented Jan 12, 2024 •

edited

Loading

d0choa left a comment

feat(l2g): add coloc based features #400

feat(l2g): add coloc based features #400

Conversation

ireneisdoomed commented Jan 10, 2024 • edited Loading

ireneisdoomed Jan 10, 2024

Choose a reason for hiding this comment

addramir Jan 11, 2024

Choose a reason for hiding this comment

addramir Jan 11, 2024

Choose a reason for hiding this comment

ireneisdoomed Jan 12, 2024

Choose a reason for hiding this comment

d0choa Jan 12, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jan 12, 2024 • edited Loading

Codecov Report

d0choa left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Jan 10, 2024 •

edited

Loading

codecov-commenter commented Jan 12, 2024 •

edited

Loading