-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rightsize import batches [VS-486] #7925
Conversation
else if (num_samples <= 100) then 1 | ||
else if (num_samples <= 5000) then 5 | ||
else if (num_samples <= 20000) then 10 | ||
else 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps out of scope for this pr, but two suggestions:
- let's maybe do this as a rounded fraction based on the max # of cromwell vms we feel comfortable running (rather than a step function)
- do we want to throw an error or maybe just a warning if someone is trying to run a callset with > 20k samples and they haven't explicitly set a load_data_batch_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(bwahaha this is why I want to scale test with exomes)
Codecov Report
@@ Coverage Diff @@
## ah_var_store #7925 +/- ##
================================================
Coverage ? 86.288%
Complexity ? 35189
================================================
Files ? 2170
Lines ? 164888
Branches ? 17786
================================================
Hits ? 142278
Misses ? 16288
Partials ? 6322 |
4cc3c7d
to
9636547
Compare
Successful run of GvsJointVariantCalling for this branch. Note the 10 import shards with 1 sample each that were created by default for this sample set of size 10. |
Int max_auto_batch_size = 20000 | ||
|
||
if ((num_samples > max_auto_batch_size) && !(defined(load_data_batch_size))) { | ||
call Utils.TerminateWorkflow as DieDueToTooManySamplesWithoutExplicitLoadDataBatchSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this
# At least 1, per limits above not more than 20. | ||
Int effective_load_data_batch_size = if (defined(load_data_batch_size)) then select_first([load_data_batch_size]) | ||
else if num_samples < 1000 then 1 | ||
else num_samples / 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to floor / ceil or round this in some way? Or does the Int type do that for us? Could it get us into trouble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should be floor
ing for us.
Int effective_load_data_maxretries = if (defined(load_data_maxretries_override)) then select_first([load_data_maxretries_override]) | ||
else if (effective_load_data_batch_size < 12) then 3 | ||
else effective_load_data_batch_size / 4 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if you are already changing things---I would appreciate a little more prose on why this is being done this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ok, I'll try to come up with something... 😅
9636547
to
33ab249
Compare
Automatically choose appropriate import batch, preemptible, and max retry values for sample sets up to 20K. This makes 20K samples our effective threshold for where we feel comfortable with the current stability / performance of our import code.