You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For loading VCF data into BigQuery, Variant Transforms uses Cloud Dataflow. Dataflow now provides a flag which can be passed that brings down the cost of using Dataflow and has been demonstrated to work well for Variant Transforms.
Details about Flexible Resource Scheduling (FlexRS) can be found here:
Note that users will likely want to read through the doc carefully to understand how it works, whether they'll need to update any Quotas, and overall what to expect (including the likelihood that job start will be delayed).
At a high level:
FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and currently a combination of preemptible virtual machine (VM) instances and regular VMs.
and can be used with Variant Transforms by updating your COMMAND from:
We used this successfully for loading variants for over 9,000 WGS samples that were joint genotyped. For these particular tests, the cost dropped by about half. A few important things to note:
The fix for #658 wasn't strictly necessary, but the use of COST_OPTIMIZED introduces a delay to starting the (unnecessary) merge_headers Dataflow. On a few occasions, this added a couple of hours to the overall time to run vcf_to_bq.
The text was updated successfully, but these errors were encountered:
For loading VCF data into BigQuery, Variant Transforms uses Cloud Dataflow. Dataflow now provides a flag which can be passed that brings down the cost of using Dataflow and has been demonstrated to work well for Variant Transforms.
Details about Flexible Resource Scheduling (FlexRS) can be found here:
https://cloud.google.com/dataflow/docs/guides/flexrs
Note that users will likely want to read through the doc carefully to understand how it works, whether they'll need to update any Quotas, and overall what to expect (including the likelihood that job start will be delayed).
At a high level:
and can be used with Variant Transforms by updating your COMMAND from:
to:
We used this successfully for loading variants for over 9,000 WGS samples that were joint genotyped. For these particular tests, the cost dropped by about half. A few important things to note:
n1-standard-2
workers instead ofn1-highmem-16
(though runtime was longer)--disk_size_gb
value (Dataflow takes care of disk allocation automatically with flexRS)The fix for #658 wasn't strictly necessary, but the use of
COST_OPTIMIZED
introduces a delay to starting the (unnecessary)merge_headers
Dataflow. On a few occasions, this added a couple of hours to the overall time to runvcf_to_bq
.The text was updated successfully, but these errors were encountered: