Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak extract values based on Echo Runs [VS-1432] #8979

Merged
merged 3 commits into from
Sep 13, 2024
Merged

Conversation

rsasch
Copy link

@rsasch rsasch commented Sep 13, 2024

No description provided.

@@ -172,6 +172,7 @@ You can take advantage of our existing sub-cohort WDL, `GvsExtractCohortFromSamp
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
- Specify the `interval_weights_bed` appropriate for the PGEN extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
- Select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
- Set the `extract_maxretries_override` input to 5, `split_intervals_disk_size_override` to 1000, `scatter_count` to 25000, and `y_bed_weight_scaling` to 8 to start; you will likely have to adjust one or more of these values in subsequent attempts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks good--esp since you are using real world experience from Echo
Do we have any record of the thought process behind this? I know at one point there was a doc?
anyway, LGTM but I'm def eluded by why these numbers are ideal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head:

  • split_intervals_disk_size_override to 1000 because otherwise we run out of disk space 😭
  • scatter_count - this one I think may vary depending on the interval list? Pretty sure the default 34K is almost guaranteed to give us Cromwell problems (and maybe that should be changed)
  • y_bed_weight_scaling we were doing both X and Y at a scale factor of 4 and Y shards were still the laggards. We might even go higher than 8 but we didn't actually try that during Echo.

Copy link

@koncheto-broad koncheto-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Seems superficially odd that we ultimately want to reduce the disk size for pgen on larger callsets, but I trust the results of your analysis

@rsasch
Copy link
Author

rsasch commented Sep 13, 2024

LGTM. Seems superficially odd that we ultimately want to reduce the disk size for pgen on larger callsets, but I trust the results of your analysis

Disk size is less of a concern because (at least with GCP historically) going too low on disk size risks much slower I/O without commiserate savings.

@koncheto-broad
Copy link

LGTM. Seems superficially odd that we ultimately want to reduce the disk size for pgen on larger callsets, but I trust the results of your analysis

Disk size is less of a concern because (at least with GCP historically) going too low on disk size risks much slower I/O without commiserate savings.

Yes, that's also what I thought as well. That's why it seemed a little odd that this PR included a change that lowered the default disk size on the VMs if they weren't specified from 500 to 200. I always thought going too low was the worry, and that disk size in general isn't much of a concern. But this PR appears to lower the default disk size by 60%, unless I am reading that completely wrong.

@rsasch
Copy link
Author

rsasch commented Sep 13, 2024

LGTM. Seems superficially odd that we ultimately want to reduce the disk size for pgen on larger callsets, but I trust the results of your analysis

Disk size is less of a concern because (at least with GCP historically) going too low on disk size risks much slower I/O without commiserate savings.

Yes, that's also what I thought as well. That's why it seemed a little odd that this PR included a change that lowered the default disk size on the VMs if they weren't specified from 500 to 200. I always thought going too low was the worry, and that disk size in general isn't much of a concern. But this PR appears to lower the default disk size by 60%, unless I am reading that completely wrong.

The logs showed a max of 7% disk space, so it seemed fair to reduce it somewhat. But your comment reminded me that I also meant to adjust the effective_extract_memory_gib calculation, so I will do that now.

@mcovarr
Copy link
Collaborator

mcovarr commented Sep 13, 2024

No VCF extract changes?

@rsasch
Copy link
Author

rsasch commented Sep 13, 2024

No VCF extract changes?

It looked like running it with the 1.5 retry with more memory did the trick, unless you have other info?

@rsasch rsasch merged commit cab8d52 into ah_var_store Sep 13, 2024
20 of 21 checks passed
@rsasch rsasch deleted the rsa_vs_1432 branch September 13, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants