Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improvements to generate 2401 data release #436

Merged
merged 25 commits into from
Jan 18, 2024
Merged

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Jan 18, 2024

This PR includes changes in the ETL process to generate all outputs for the data release. It's easier to track changes looking at the commit history, but overall:

  • Changes to L2G:
    • Separation between train and test steps in the DAG
    • Inclusion of vepMeanNeighborhood and vepMean in the features list
    • Revert of 4eafaf2: this introduced a bug in the generation of coloc features. I have changed the approach.
    • Added specific Spark configuration to allocate driver and executor memory
    • Other bugfixes
  • Inclusion of the colocalisation step in the DAG
  • Addition of gene_index to the data release bucket. I need to write this dataset to extract V2G. Conceptually it's very similar to target_index but it's just 3Mb.
  • Infrastructure changes: n1-highmem-8 was not enough to run the L2G steps, because most of the operations happen in the node. create_cluster's default machine is now n1-highmem-16.

--- Added later:

  • Added interaction dataset to the data release bucket. Similar case as target_index. This is an input of L2G to train the model.
  • Rearranged the inputs for V2G and L2G so that they all point to files in static_assets or in the release bucket

@@ -64,4 +63,5 @@ colocalisation: ${datasets.release_folder}/colocalisation
study_index: ${datasets.release_folder}/study_index
variant_index: ${datasets.release_folder}/variant_index
credible_set: ${datasets.release_folder}/credible_set
gene_index: ${datasets.release_folder}/gene_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why the target dataset is in the output. The ETL does nothing with genes: we are not aggregating information for genes, we are not enriching the gene dataset (as far as I know). All gene related information is either in the v2g or variant annotation dataset. Again, if there would be an actual product downstream that would require gene metadata I would understand the placement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are all thinking the same from slightly different angles. No right or wrong. Whatever we do now is likely to change again

Copy link
Contributor Author

@ireneisdoomed ireneisdoomed Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply because the gene index is a dependency of V2G.
If it's a big deal, I can delete it after V2G is done, we can delete the gene index step and generate it OTF in the V2G step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think intervals are also input v2g, do we want to share that dataset as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Intervals are a static asset, target is not.

@ireneisdoomed ireneisdoomed marked this pull request as ready for review January 18, 2024 14:28
@codecov-commenter
Copy link

Codecov Report

Attention: 178 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (08bf447) 85.99%.
Report is 86 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #436      +/-   ##
==========================================
+ Coverage   85.67%   85.99%   +0.31%     
==========================================
  Files          89       96       +7     
  Lines        2101     2627     +526     
==========================================
+ Hits         1800     2259     +459     
- Misses        301      368      +67     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <100.00%> (ø)
src/airflow/dags/dag_preprocess.py 100.00% <ø> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/airflow/dags/gwas_curation_update.py 100.00% <100.00%> (ø)
src/gentropy/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/data/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/schemas/__init__.py 100.00% <ø> (ø)
src/gentropy/cli.py 91.66% <100.00%> (ø)
src/gentropy/common/Liftover.py 80.64% <ø> (ø)
... and 81 more

... and 10 files with indirect coverage changes

@d0choa d0choa merged commit 84f794d into dev Jan 18, 2024
3 checks passed
@ireneisdoomed ireneisdoomed deleted the il-generate-2401 branch January 18, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants