This is not an official Google product.
dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud. dsub uses Docker, which makes it easy to package up portable code that people can run anywhere Docker is supported.
The dsub user experience is modeled after traditional high-performance computing job schedulers like Grid Engine and Slurm. You write a script and then submit it to a job scheduler from a shell prompt on your local machine.
For now, dsub supports Google Cloud as the backend batch job runner. With help from the community, we'd like to add other backends, such as a local runner, Grid Engine, Slurm, Amazon Batch, and Azure Batch.
If others find dsub useful, our hope is to contribute dsub to an open-source foundation for use by the wider batch computing community.
-
Create and activate a Python virtualenv (optional but strongly recommended).
# (You can do this in a directory of your choosing.) virtualenv dsub_libs source dsub_libs/bin/activate
-
Clone this repository.
git clone https://github.com/googlegenomics/dsub cd dsub
-
Install dsub (this will also install the dependencies)
python setup.py install
-
Set up Bash tab completion (optional).
source bash_tab_complete
-
Verify the installation by running:
dsub --help
-
(Optional) Install Docker.
This is necessary only if you're going to create your own Docker images or use the
local
provider.
-
Sign up for a Google Cloud Platform account and create a project.
-
Install the Google Cloud SDK and run
gcloud init
This will set up your default project and grant credentials to the Google Cloud SDK. Now provide credentials so dsub can call Google APIs:
gcloud auth application-default login
-
Create a Google Cloud Storage bucket.
The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.
gsutil mb gs://my-bucket
Change
my-bucket
to a unique name that follows the bucket-naming conventions.(By default, the bucket will be in the US, but you can change or refine the location setting with the
-l
option.)
Here's the simplest example:
dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
--command 'echo hello'
Change my-cloud-project
to your Google Cloud project, and my-bucket
to
the bucket you created above.
After running dsub, the output will be a server-generated job id. The output of the script command will be written to the logging folder.
The following sections show how to run more complex jobs.
You can provide a shell command directly in the dsub command-line, as in the hello example above.
You can also save your script to a file, like hello.sh
. Then you can run:
dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
--script hello.sh
If your script has dependencies that are not stored in your Docker image, you can transfer them to the local disk. See the instructions below for working with input and output files and folders.
By default, dsub uses a stock Ubuntu image. You can change the image
by passing the --image
flag.
dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
--image ubuntu:16.04 \
--script hello.sh
You can pass environment variables to your script using the --env
flag.
dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
--env MESSAGE=hello \
--command 'echo ${MESSAGE}'
The environment variable MESSAGE
will be assigned the value hello
when
your Docker container runs.
Your script or command can reference the variable like any other Linux
environment variable, as ${MESSAGE}
.
Be sure to enclose your command string in single quotes and not double
quotes. If you use double quotes, the command will be expanded in your local
shell before being passed to dsub. For more information on using the
--command
flag, see Scripts, Commands, and Docker
To set multiple environment variables, you can repeat the flag:
--env VAR1=value1 \
--env VAR2=value2
You can also set multiple variables, space-delimited, with a single flag:
--env VAR1=value1 VAR2=value2
dsub mimics the behavior of a shared file system using cloud storage bucket paths for input and output files and folders. You specify the cloud storage bucket path. Paths can be:
- file paths like
gs://my-bucket/my-file
- folder paths like
gs://my-bucket/my-folder
- wildcard paths like
gs://my-bucket/my-folder/*
See the inputs and outputs documentation for more details.
If your script expects to read local input files that are not already contained within your Docker image, the files must be available in Google Cloud Storage.
If your script has dependent files, you can make them available to your script by:
- Building a private Docker image with the dependent files and publishing the image to a public site, or privately to Google Container Registry
- Uploading the files to Google Cloud Storage
To upload the files to Google Cloud Storage, you can use the storage browser or gsutil. You can also run on data that’s public or shared with your service account, an email address that you can find in the Google Cloud Console.
To specify input and output files, use the --input
and --output
flags:
dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
--input INPUT_FILE=gs://my-bucket/my-input-file \
--output OUTPUT_FILE=gs://my-bucket/my-output-file \
--command 'cat ${INPUT_FILE} > ${OUTPUT_FILE}'
The input file will be copied from gs://my-bucket/my-input-file
to a local
path given by the environment variable ${INPUT_FILE}
. Inside your script, you
can reference the local file path using the environment variable.
The output file will be written to local disk at the location given by
${OUTPUT_FILE}
. Inside your script, you can reference the local file path
using the environment variable. After the script completes, the output file
will be copied to the bucket path gs://my-bucket/my-output-file
.
To copy folders rather than files, use the --input-recursive
or
output-recursive
flags:
dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
--input-recursive FOLDER=gs://my-bucket/my-folder \
--command 'find ${FOLDER} -name "foo*"'
As a getting started convenience, if --input-recursive
or --output-recursive
are used, dsub
will automatically check for and, if needed, install the
Google Cloud SDK in the Docker container
at runtime (before your script executes).
If you use the recursive copy features, install the Cloud SDK in your Docker image when you build it to avoid the installation at runtime.
If you use a Debian or Ubuntu Docker image, you are encouraged to use the package installation instructions.
If you use a Red Hat or CentOS Docker image, you are encouraged to use the package installation instructions.
By default, dsub launches a VM with a single CPU core, a default number of GB of memory (3.75 GB on Google Compute Engine), and a default disk size (200 GB).
To change the minimum RAM, use the --min-ram
flag.
To change the minimum number of CPU cores, use the --min-cores
flag.
To change the disk size, use the --disk-size
flag.
Before you choose especially large or unusual values, be sure to check the available VM instance types and maximum disk size. On Google Cloud, the machine type will be selected from the best matching predefined machine types.
Each of the examples above has demonstrated submitting a single task with
a single set of variables, inputs, and outputs. If you have a batch of inputs
and you want to run the same operation over them, dsub
allows you
to create a batch job.
Instead of calling dsub
repeatedly, you can create
a tab-separated values (TSV) file containing the variables,
inputs, and outputs for each task, and then call dsub
once.
The result will be a single job-id
with multiple tasks. The tasks will
be scheduled and run independently, but can be
monitored and deleted as a group.
The first line of the TSV file specifies the names and types of the parameters. For example:
--env SAMPLE_ID<tab>--input VCF_FILE<tab>--output OUTPUT_PATH
The first line also supports bare-word variables which are treated as the names of environment variables. This example is equivalent to the previous:
SAMPLE_ID<tab>--input VCF_FILE<tab>--output OUTPUT_PATH
Each addition line in the file should provide the variable, input, and output values for each task. Each line represents the values for a separate task.
Pass the TSV file to dsub using the --tasks
parameter. This parameter
accepts both the file path and optionally a range of tasks to process.
For example, suppose my-tasks.tsv
contains 101 lines: a one-line header and
100 lines of parameters for tasks to run. Then:
dsub ... --tasks ./my-tasks.tsv
will create a job with 100 tasks, while:
dsub ... --tasks ./my-tasks.tsv 1-10
will create a job with 10 tasks, one for each of lines 2 through 11.
The task range values can take any of the following forms:
m
indicates to submit taskm
(line m+1)m-
indicates to submit all tasks starting with taskm
m-n
indicates to submit all tasks fromm
ton
(inclusive).
The --logging
flag points to a location for dsub
task log files. For details
on how to specify your logging path, see Logging.
It's possible to wait for a job to complete before starting another, see job control with dsub.
You can add custom labels to jobs and tasks, which allows you to monitor and
cancel tasks using your own identifiers. In addition, with the google
provider, labeling a task will label associated compute resources such as
virtual machines and disks.
For more details, see Checking Status and Troubleshooting Jobs
The dstat
command displays the status of jobs:
dstat --project my-cloud-project
With no additional arguments, dstat will display a list of running jobs for
the current USER
.
To display the status of a specific job, use the --jobs
flag:
dstat --project my-cloud-project --jobs job-id
For a batch job, the output will list all running tasks.
Each job submitted by dsub is given a set of metadata values that can be used for job identification and job control. The metadata associated with each job includes:
job-name
: defaults to the name of your script file or the first word of your script command; it can be explicitly set with the--name
parameter.user-id
: theUSER
environment variable value.job-id
: takes the formjob-name--userid--timestamp
where thejob-name
is truncated at 10 characters and thetimestamp
is of the formYYMMDD-HHMMSS-XX
, unique to hundredths of a second.task-id
: if the job is submitted with the--tasks
parameter, each task gets a sequential value of the form "task-n" where n is 1-based.
Metadata can be used to cancel a job or individual tasks within a batch job.
The ddel
command will delete running jobs.
By default, only jobs submitted by the current user will be deleted.
Use the --users
flag to specify other users, or "*"
for all users.
To delete a running job:
ddel --project my-cloud-project --jobs job-id
If the job is a batch job, all running tasks will be deleted.
To delete specific tasks:
ddel \
--project my-cloud-project \
--jobs job-id \
--tasks task-id1 task-id2
To delete all running jobs for the current user:
ddel --project my-cloud-project --jobs "*"
-
See the examples:
-
See more documentation for: