This package provides a mechanism for Snakemake workflows to explicitly "stage out" the output files from certain rules to a public repository like Zenodo to allow faster re-execution of the workflow, using these previously generated artifacts. This can be especially useful for workflows with computationally expensive rules that don't need to be frequently re-run.
snakemake-staging
is a spin-off of the
showyourwork
project, which
provides a "caching" framework for Snakemake workflows, to transparently avoid
re-execution of rules that have been cached to Zenodo. The
implementation of this logic in showyourwork
is, however, somewhat fragile and
unpredictable. In snakemake-staging
, we take a more explicit approach, where
"staged" rules are always either explicitly executed or restored.
To use snakemake-staging
in your workflow, you can install it using pip
(it's probably best to set up your Snakemake installation following the
Snakemake
docs
first):
python -m pip install snakemake-staging
While testing, it's probably best to use the Zenodo
Sandbox, rather than the main site, since any
archive published to Zenodo is permanent. To use the sandbox, you'll need a
personal access token stored in the SANDBOX_TOKEN
environment variable. You
can generate a new token
here.
Once you've added this token to your environment, you can edit the Snakefile for
your workflow to use snakemake-staging
as follows. First, towards the top of
your Snakefile, add:
import snakemake_staging as staging
stage = staging.ZenodoStage(
"zenodo-stage",
config.get("restore", False)
)
to create a new stage called zenodo-stage
. Note that here we're extracting a
restore
flag from the Snakemake config, which will be used to determine
whether to restore files for the stage. This means that you can control the
behavior of this stage from the command line. By passing --config restore=True
to the snakemake
command line interface, all files staged out by the
zenodo-stage
stage will be restored from the archive rather than generated.
Then, to stage out a rule, you can apply the stage as follows:
rule expensive:
input:
...
output:
stage(
"path/to/output1.txt",
"path/to/output2.txt",
)
shell:
...
Finally, after defining all the rules that you want to stage out, you must
add the following include
which defines all the staging rules:
include: staging.snakefile()
At this point, here's the full Snakefile
:
Full Snakefile
import snakemake_staging as staging
stage = staging.ZenodoStage(
"zenodo-stage",
config.get("restore", False)
)
rule expensive:
input:
...
output:
stage(
"path/to/output1.txt",
"path/to/output2.txt",
)
shell:
...
include: staging.snakefile()
With the Snakefile
defined in the previous section, you can now run your
workflow in 3 ways:
-
Normal execution: If you run something like
snakemake path/to/output1.txt
(where I have omitted the usual--cores
and--conda
arguments) will execute the workflow as normal, without staging out any files. -
Stage upload: If you instead have Snakemake target the
staging__upload
rule, theexpensive
rule will be executed, and the outputs will be uploaded to Zenodo, saving the record information tozenodo-stage.zenodo.json
(this filename can be changed by passing theinfo_file
argument to theZenodoStage
constructor). -
Stage restore: Finally, after these outputs have been uploaded to Zenodo, you can call Snakemake
--config restore=True
to disable theexpensive
rule, and force the outputs to be restored from Zenodo.