-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: improvements to generate 2401 data release #436
Conversation
@@ -64,4 +63,5 @@ colocalisation: ${datasets.release_folder}/colocalisation | |||
study_index: ${datasets.release_folder}/study_index | |||
variant_index: ${datasets.release_folder}/variant_index | |||
credible_set: ${datasets.release_folder}/credible_set | |||
gene_index: ${datasets.release_folder}/gene_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why the target dataset is in the output. The ETL does nothing with genes: we are not aggregating information for genes, we are not enriching the gene dataset (as far as I know). All gene related information is either in the v2g or variant annotation dataset. Again, if there would be an actual product downstream that would require gene metadata I would understand the placement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are all thinking the same from slightly different angles. No right or wrong. Whatever we do now is likely to change again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simply because the gene index is a dependency of V2G.
If it's a big deal, I can delete it after V2G is done, we can delete the gene index step and generate it OTF in the V2G step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think intervals are also input v2g, do we want to share that dataset as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Intervals are a static asset, target is not.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #436 +/- ##
==========================================
+ Coverage 85.67% 85.99% +0.31%
==========================================
Files 89 96 +7
Lines 2101 2627 +526
==========================================
+ Hits 1800 2259 +459
- Misses 301 368 +67
|
This PR includes changes in the ETL process to generate all outputs for the data release. It's easier to track changes looking at the commit history, but overall:
vepMeanNeighborhood
andvepMean
in the features listgene_index
to the data release bucket. I need to write this dataset to extract V2G. Conceptually it's very similar totarget_index
but it's just 3Mb.n1-highmem-8
was not enough to run the L2G steps, because most of the operations happen in the node.create_cluster
's default machine is nown1-highmem-16
.--- Added later:
interaction
dataset to the data release bucket. Similar case astarget_index
. This is an input of L2G to train the model.static_assets
or in the release bucket