chore: improvements to generate 2401 data release #436

ireneisdoomed · 2024-01-18T11:59:52Z

This PR includes changes in the ETL process to generate all outputs for the data release. It's easier to track changes looking at the commit history, but overall:

Changes to L2G:
- Separation between train and test steps in the DAG
- Inclusion of vepMeanNeighborhood and vepMean in the features list
- Revert of 4eafaf2: this introduced a bug in the generation of coloc features. I have changed the approach.
- Added specific Spark configuration to allocate driver and executor memory
- Other bugfixes
Inclusion of the colocalisation step in the DAG
Addition of gene_index to the data release bucket. I need to write this dataset to extract V2G. Conceptually it's very similar to target_index but it's just 3Mb.
Infrastructure changes: n1-highmem-8 was not enough to run the L2G steps, because most of the operations happen in the node. create_cluster's default machine is now n1-highmem-16.

--- Added later:

Added interaction dataset to the data release bucket. Similar case as target_index. This is an input of L2G to train the model.
Rearranged the inputs for V2G and L2G so that they all point to files in static_assets or in the release bucket

…steps

into il-generate-2401

DSuveges · 2024-01-18T12:23:08Z

config/datasets/ot_gcp.yaml

@@ -64,4 +63,5 @@ colocalisation: ${datasets.release_folder}/colocalisation
 study_index: ${datasets.release_folder}/study_index
 variant_index: ${datasets.release_folder}/variant_index
 credible_set: ${datasets.release_folder}/credible_set
+gene_index: ${datasets.release_folder}/gene_index


I still don't understand why the target dataset is in the output. The ETL does nothing with genes: we are not aggregating information for genes, we are not enriching the gene dataset (as far as I know). All gene related information is either in the v2g or variant annotation dataset. Again, if there would be an actual product downstream that would require gene metadata I would understand the placement.

I think we are all thinking the same from slightly different angles. No right or wrong. Whatever we do now is likely to change again

It's simply because the gene index is a dependency of V2G.
If it's a big deal, ~~I can delete it after V2G is done~~, we can delete the gene index step and generate it OTF in the V2G step.

I think intervals are also input v2g, do we want to share that dataset as well?

No. Intervals are a static asset, target is not.

into il-generate-2401

codecov-commenter · 2024-01-18T14:29:38Z

Codecov Report

Attention: 178 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (08bf447) 85.99%.
Report is 86 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #436      +/-   ##
==========================================
+ Coverage   85.67%   85.99%   +0.31%     
==========================================
  Files          89       96       +7     
  Lines        2101     2627     +526     
==========================================
+ Hits         1800     2259     +459     
- Misses        301      368      +67

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <100.00%> (ø)`
src/airflow/dags/dag_preprocess.py	`100.00% <ø> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/airflow/dags/gwas_curation_update.py	`100.00% <100.00%> (ø)`
src/gentropy/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/data/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/schemas/__init__.py	`100.00% <ø> (ø)`
src/gentropy/cli.py	`91.66% <100.00%> (ø)`
src/gentropy/common/Liftover.py	`80.64% <ø> (ø)`
... and 81 more

... and 10 files with indirect coverage changes

ireneisdoomed added 19 commits January 17, 2024 13:19

fix(dag): remove ot_gwas_catalog ot_study_locus_overlap from etl dag …

f981986

…steps

fix(dag): remove ot_gwas_catalog ot_study_locus_overlap from etl dag …

f5365c4

…steps

fix: include gene_index as a release output

15880e5

feat(l2g): split step into train and predict

e4b2781

chore(dag): add colocalisation step

b0cadd8

feat(l2g): split step into train and predict

051e437

chore: rename ot_v2g config to unabbreviated name

ffa3be9

chore(colocalisation): remove coloc parameters from config

52aefc2

fix(colocalisation): update credible set path in config

a44a311

fix(l2g): remove overlaps from config

a62f837

chore(dag): remove ukbiobank and eqtl from preprocess

e0dbdf5

fix(l2g): increase driver and executors memory

4be6241

fix(l2g): make training dependencies optional

d038ec5

fix(l2g): drop studyType before creating gwas_study_locus

3db981f

fix(l2g): convert features_list from config to list

e679b12

fix(l2g): include mean vep features in feature_list

f50e2d2

chore: change default driver node to n1-highmem-16

72e9504

revert(l2g): revert 4eafaf2

59dc894

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

9438733

into il-generate-2401

DSuveges reviewed Jan 18, 2024

View reviewed changes

ireneisdoomed added 6 commits January 18, 2024 13:06

chore: fetch etl inputs from new data structure

a6f60bd

feat(study_locus): add and test filter_by_study_type

cec5737

feat(study_locus): add and test filter_by_study_type

9b707d1

feat(l2g): limit l2g predictions to gwas-derived associations

beb6d61

fix: typo in test_filter_by_study_type

ed95815

Merge branch 'dev' of https://github.com/opentargets/genetics_etl_python

08bf447

into il-generate-2401

ireneisdoomed marked this pull request as ready for review January 18, 2024 14:28

d0choa approved these changes Jan 18, 2024

View reviewed changes

d0choa merged commit 84f794d into dev Jan 18, 2024
3 checks passed

ireneisdoomed deleted the il-generate-2401 branch January 18, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improvements to generate 2401 data release #436

chore: improvements to generate 2401 data release #436

ireneisdoomed commented Jan 18, 2024 •

edited

Loading

DSuveges Jan 18, 2024

d0choa Jan 18, 2024

ireneisdoomed Jan 18, 2024 •

edited

Loading

DSuveges Jan 18, 2024

ireneisdoomed Jan 18, 2024

codecov-commenter commented Jan 18, 2024

chore: improvements to generate 2401 data release #436

chore: improvements to generate 2401 data release #436

Conversation

ireneisdoomed commented Jan 18, 2024 • edited Loading

DSuveges Jan 18, 2024

Choose a reason for hiding this comment

d0choa Jan 18, 2024

Choose a reason for hiding this comment

ireneisdoomed Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

DSuveges Jan 18, 2024

Choose a reason for hiding this comment

ireneisdoomed Jan 18, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jan 18, 2024

Codecov Report

ireneisdoomed commented Jan 18, 2024 •

edited

Loading

ireneisdoomed Jan 18, 2024 •

edited

Loading