A flowchart illustrating how each workflow component fits together with tools into the overall process is included below:
Dahak workflows are run from the command line using Snakemake, a Python package that provides similar capabilities to GNU make. Each workflow consists of a set of Snakemake rules.
Snakemake is a Python program that assembles and runs tasks using a task graph approach. See Installing for instructions on how to install it.
Dahak workflows benefit directly from Snakemake's rich feature set and capabilities. There is an extensive documentation page on executing Snakemake, and its command line options. There are other projects demonstrating ways of creating snakemake-profiles, or platform-specific configuration profiles.
Generally, Snakemake is called by passing command line flags and the name of a target file or rule name:
snakemake [FLAGS] <target>
Targets for each workflow are listed on the respective "Snakemake Rules" page for that workflow (see left side navigation menu).
There are two types of targets defined:
Target Files: The user can ask Snakemake to generate a particular file, and Snakemake will dynamically determine the rules that are required to generate the requested file. Snakemake uses a dependency graph to determine what rules to run to generate the file.
Build Rules: There are rules that do not themselves do anything but that trigger all of the rules in a given workflow. (The build rules work by assembling filenames and passing target filenames to Snakemake.)
Users should use the build rules to trigger workflows.
The build rules require workflow configuration details to be set using Snakemake's configuration dictionary. See the Snakemake Configuration page for details.
Each workflow has a set of "build rules" that will trigger rules for a given workflow or portion of a workflow. Available build rules for each workflow are listed on the respective "Snakemake Rules" page for that workflow (see left side navigation menu).
The build rules require some information about which read files to run the workflow on; the information required is covered on each "Snakemake Rules" page.
The Quick Start covers some examples.
Workflow parameters are specified by passing a JSON configuration file to Snakemake.
The default workflow parameter values are set in default_workflowparams.settings
.
Any of these values can be overridden using a custom JSON file, as described
above and on the Workflow Configuration page.
For example, to override the default version of trimmomatic (0.36) and use 0.38
instead, the following JSON would override the version to 0.38--5
:
{
"biocontainers" : {
"trimmomatic" : {
"use_local" : false,
"quayurl" : "quay.io/biocontainers/trimmomatic",
"version" : "0.38--5"
}
}
}
This can be placed in a JSON file like config/custom_trimmomatic.json
(in
the workflows/
directory) and passed to Snakemake using the --config
flag like:
snakemake --config=config/custom_trimmomatic.json \
[FLAGS] <target>
Singularity is a containerization technology similar to Docker but without the
need for root access, making it possible to use Docker containers in high
performance computing environments. Snakefiles in Dahak contain singularity:
directives, which specify a Singularity image to pull and use to run the given
commands. These directives will be ignored by Snakemake unless it is run with
the --use-singularity
flag. If the --use-singularity
flag is passed to
Snakemake on the command line, Snakemake will run the commands for each rule
in separate singularity containers:
snakemake --use-singularity <target>
When Singularity containers are run, a host directory can be bind-mounted inside the container to provide a shared-access folder on the host filesystem.
To specify a directory for Singularity to bind-mount, use the
SINGULARITY_BINDPATH
environment variable:
SINGULARITY_BINDPATH="my_data:/data" snakemake --use-singularity <target>
This bind-mounts the directory my_data/
into the Singularity container at /data/
.
If you are running Singularity on an HPC cluster, the SINGULARITY_BINDPATH
variable
can be handled in the cluster configuration file as a part of all the other
cluster configuration details.
See the HPC Quickstart.
When executing Snakemake using batch job systems, Snakemake requires two things:
-
Cluster Configuration (JSON file) - the cluster configuration file specifies what cluster parameters (required amount of memory, type of node, etc.) should be used when submitting jobs for each rule. (A download step might require a smaller node for a short amount of time, while an assembly step might require a high-memory node for a longer amount of time.)
-
Job Script (Bash script) - this is the actual script that is run once the job has finished waiting in the queue and the compute nodes are ready to go. This script should load modules like Python, Conda, Snakemake, and Singularity.
-
Job Submission Command (command line argument) - Snakemake also requires a templated job submission command - that is, the command that Snakemake uses to actually submit jobs. This command is provided to Snakemake on the command line using the
--cluster "<submit-cmd>"
flag.
Finally, when the workflow is executed, Snakemake will assemble the task graph, and will use the job submission script to request one node per rule. (If rules are in a group, they will be grouped and executed together on a single node.)
Also see the HPC Quickstart.
The Snakemake documentation discusses executing workflows on clusters at Cluster Execution.
To set the scratch/working directory for the Snakemake workflows,
in which all intermediate and final data files will be placed,
set the data_dir
key in the Snakemake configuration dictionary.
If this option is not set, it will be data
by default.
(No trailing slash is needed when specifying the directory name.)
!!! warning "Singularity Bind Path"
If you use a custom directory by setting the `data_dir` key,
you must also adjust the `SINGULARITY_BINDPATH` variable
accordingly.
For example, to put all intermediate files into the work/
directory
instead of the data/
directory, the following very short JSON file
could be used for the Snakemake configuration dictionary (this would
use default values for everything except data_dir
):
{
"data_dir" : "work"
}
This JSON file can be used as the Snakemake configuration dictionary
by passing the JSON file name to the --configfile
flag to Snakemake
and updating the Singularity environment variable:
SINGULARITY_BINDPATH="work:/work" \
snakemake --configfile=config/custom_scratch.settings \
[FLAGS] <target>
All together, the command to run a Dahak workflow will look like this:
SINGULARITY_BINDPATH="data:/data" snakemake \
--configfile my_workflow_params.json \
--use-singularity \
<target>