This is an explanation of how to launch DeepVariant-on-Spark in Google Cloud.
Google Cloud Dataproc (Cloud Dataproc) is a cloud-based managed Spark and Hadoop service offered on Google Cloud Platform.
To access DataProc, plese install gsutil
first. You can go to
Google Cloud for
installation guide.
For password-less deployment, your SSH key is required. Please refer to
this link
for acquiring your SSH Key.
gcloud beta dataproc clusters create my-dos \
--subnet default --zone us-west1-b \
--num-workers 2 --worker-machine-type n1-highmem-16 \
--image-version 1.2.59-deb9 \
--initialization-actions gs://seqslab-deepvariant/scripts/ \
--initialization-action-timeout 20m
gcloud beta dataproc clusters delete my-dos
DeepVariant-on-Spark leverage Ansible to deploy SeqPiper and related
packages to DataProc Cluster. For password-less deployment, your SSH key
(i.e. ~/.ssh/google_compute_engine) should be added by using ssh-add
first. When the cluster has been launched completely, you can login the
terminal of the master via Google Cloud Platform or the following
ssh-add -K ~/.ssh/google_compute_engine
gcloud compute ssh --ssh-flag="-A" my-dos-m --zone="us-west1-b"
Note: if ssh-add
is failed and the error message is like "Error
connecting to agent: No such file or directory", please use the
following command first.
ssh-agent bash
DeepVariant-on-Spark leverages SeqPiper
, a wrapper of Spark Pipe()
to wrap DeepVariant in Spark. Please clone DeepVaraint-on-Spark github
repo. and use Ansible, IT automation tools, to install SeqPiper and
related packages followed by the following commands:
git clone
cd deepvariant-on-spark/ansible
ansible-playbook -i hosts prepare_env.yml
Then, DeepVariant-on-Spark will be installed automatically by Ansible. It will take 10 or more minutes to deploy all of necessary packages to the entire cluster. If successful, all of deployment has no failure and you will see the log like:
... (skipped) ...
PLAY RECAP *********************************************************************
my-dos-m : ok=21 changed=13 unreachable=0 failed=0
my-dos-w-0 : ok=26 changed=22 unreachable=0 failed=0
my-dos-w-1 : ok=26 changed=22 unreachable=0 failed=0
Please refer to Cluster Operation Portal to monitor the healthy status of YARN and HDFS.
To evaluate the deployment, the following command is able to run the whole pipeline by using a small sample.
bash ./deepvariant-on-spark/scripts/ gs://deepvariant/case-study-testdata/NA12878_sliced.bam 19 GRCH output
Then, you will have the following log if successful.
19/01/17 07:01:20 INFO GHFS version: 1.6.10-hadoop2
19/01/17 07:01:21 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at my-dos-m/
... (skipped) ...
19/01/17 07:10:32 INFO GHFS version: 1.6.10-hadoop2
19/01/17 07:10:33 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at my-dos-m/
19/01/17 07:10:35 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1547707864423_0009
[INFO] postprocess_variants completed: 00:02:05
transform_data 00:01:01
select_bam 00:01:00
make_examples 00:05:13
call_variants 00:01:59
postprocess_variants 00:02:05
Also, you can check the output files.
user@my-dos-m:~$ hadoop fs -du -h /output
5.6 M /output/alignment.bam
36.2 M /output/alignment.parquet
1.9 M /output/examples
13.3 K /output/variants
6.8 K /output/vcf
Congradulates! Let's start to run the first WGS sample.
NOTE: If any failure is occurred, please refer to the trobuleshooting session to find the root cause. If not fixed, please submit an issue to our github repo
Please login one of worker node and enter the following commands.
gsutil cp -R "${DATA_BUCKET}" .
mkdir -p "${OUTPUT_DIR}"
/usr/local/seqslab/deepvariant/bazel-bin/deepvariant/make_examples \
--mode calling \
--ref "${REF}" \
--reads "${BAM}" \
--regions "chr20:10,000,000-10,010,000" \
--examples "${OUTPUT_DIR}/examples.tfrecord.gz"
/usr/local/seqslab/deepvariant/bazel-bin/deepvariant/call_variants \
--outfile "${CALL_VARIANTS_OUTPUT}" \
--examples "${OUTPUT_DIR}/examples.tfrecord.gz" \
--execution_hardware="seqslab" \
--checkpoint "${MODEL}"
/usr/local/seqslab/deepvariant/bazel-bin/deepvariant/postprocess_variants \
--ref "${REF}" \
--infile "${CALL_VARIANTS_OUTPUT}" \
--outfile "${FINAL_OUTPUT_VCF}"
For evaluation, please check ${OUTPUT_DIR} and verify those output files and their size.
user@my-dos-w-0:~$ ls -al ${OUTPUT_DIR}
-rw-r--r-- 1 user user 4132 Jan 4 06:57 call_variants_output.tfrecord.gz
-rw-r--r-- 1 user user 532000 Jan 4 06:56 examples.tfrecord.gz
-rw-r--r-- 1 user user 154742 Jan 4 06:56 examples.tfrecord.gz.run_info.pbtxt
-rw-r--r-- 1 user user 2207 Jan 4 06:57 output.vcf.gz
Before launching a Dataproc with GPU hardware, you should check whether you have enough quota for your use case. As your use of Google Cloud Platform expands over time, your quotas may increase accordingly. If you expect a notable upcoming increase in usage, you can proactively request quota adjustments from the Quotas page in the GCP Console.
If you would like to test GPU environment, you can launch a GPU cluster by the following command:
gcloud beta dataproc clusters create my-dos \
--subnet default --zone us-west1-b \
--worker-accelerator type=nvidia-tesla-p100,count=1 \
--image-version 1.2.59-deb9 \
--initialization-actions gs://seqslab-deepvariant/scripts/ \
--initialization-action-timeout 20m
is the only step of DeepVariant which is able to be
benefited by GPU. Using the patched DeepVariant version, we can specify
the memory resource of GPU for each DeepVariant process, like
/usr/local/seqslab/deepvariant/bazel-bin/deepvariant/call_variants \
--outfile "${CALL_VARIANTS_OUTPUT}" \
--examples "${OUTPUT_DIR}/examples.tfrecord.gz" \
--execution_hardware="seqslab_gpu" \
--percentage_gpu_memory=16 \
--checkpoint "${MODEL}"