-
Notifications
You must be signed in to change notification settings - Fork 2
Setup
This is a fairly in-depth explanation on what needs to be done to execute the workflow. The general overview is:
- Create a conda environment that contains the snakemake and mamba packages
- Create a snakemake profile
- Modify workflow configuration values OR follow the slides here for a slightly alternative approach: https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g11366b6085b_0_0
Unfortunately, Snakemake does not offer a method of closing the terminal while keeping the jobs running.
This makes sense, as the main snakemake --profile slurm
command is tied directly to the main terminal process.
To overcome this, we will simply start a screen session.
This allows us to close the main terminal window, while keeping our SSH connection/instance alive.
Alternatively, you can run snakemake in a bash script submitted to SLURM, as explained in https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g11366b6085b_0_0
First, set a large scrollback for screen, so we can view more lines after we have detached from the terminal. Execute the following:
echo "defscrollback 10000" >> ~/.screenrc
The Execution Section of the Running page will show more about using screen with snakemake
NOTE: These steps can take quite a long time
Choose a name for your conda environment. This tutorial wil be using snakemake
for its environment name
- Create a conda environment:
conda create -n snakemake
. Optionally, you can change the default location of the conda environment by following the prompt - Activate the conda environment:
conda activate snakemake
- Install mamba:
conda install -n snakemake -c conda-forge mamba
. This is recommended by Snakemake, as Mamba is much faster, and Conda occationally has issues installing the most recent Snakemake version.-
conda
: The conda command -
install
: Install a package -
-n snakemake
: Install a package into a specific conda environment -
-c conda-forge
: Use the conda-forge channel to install the package -
mamba
: The package to install
-
- Install snakemake using mamba:
mamba install -c bioconda -c conda-forge snakemake
-
mamba
: The mamba command -
install
: Install a package -
-c bioconda -c conda-forge
: Use the bioconda and conda-forge packages -
snakemake
: The package to install
-
- Install CookieCutter (if it was not already installed):
pip install cookiecutter
- Test snakemake:
snakemake --version
. Ensure no errors occur, and the snakemake version is seen. The most recent version can be seen here
This section will assume you are setting up profiles for SLURM. If you are not, please reference the Snakemake Profile's GitHub Page and select your cluster's scheduler
Snakemake's profiles will allow us to submit jobs to slurm without having to write SLURM scripts.
- Create the following directory with:
mkdir -p ~/.config/snakemake/
- Change to this directory, and execute the following:
cookiecutter https://github.com/Snakemake-Profiles/slurm.git
- You do not have to enter any specific details, simply press enter until this specific setup is complete
- Copy and paste the contents of the "
config.yaml codeblock
" codeblock into theconfig.yaml
file - Change values as you like, but the ones listed are generally good-to-go
- If you do decide to change value, DO NOT modify
use-conda
, orconda-frontend
. If these values are changed away from their current setting, snakemake WILL break - Once this is done, create a new file named
cluster_config.yaml
- Enter the data under the
# cluster_config.yaml codeblock
into it - This will define where output from SLURM will go. It does not need to be changed, however, feel free to change it if you would like
- Enter the data under the
# config.yaml codeblock
# Default Values
restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
# User-modified settings
jobs: 100
printshellcmds: True
max-jobs-per-second: 10 # Default is 1 job per second
# DO NOT MODIFY THESE VALUES
# Snakemake will break if these are changed from their current setting
use-conda: True
conda-frontend: mamba
# cluster_config.yaml
__default__ :
job-name : "{rule}.{wildcards}"
ntasks : "1"
cpus-per-task : "{threads}"
nodes : "1"
output : "logs/{rule}/{rule}.{wildcards.tissue_name}.{wildcards.tag}.output"
error : "logs/{rule}/{rule}.{wildcards.tissue_name}.{wildcards.tag}.output"
When the workflow was first downloaded (from The Workflow section), a snakemake_configuration.yaml
file was downloaded. Open this file and change the values to your needs. To make the BED_FILE, RRNA_INTERVAL_LIST, and REF_FLAT_FILE see slides 3 and 4 at https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g111a3589bd3_0_0 with examples for human reference genome.
# Modify these settings to reflect your paths
MASTER_CONTROL: "controls/master_control.csv" # csv with srr, sampleName_SXRYrZ, layout, prep method
ROOTDIR: "results" # redirect output root directory
DUMP_FASTQ_FILES: "results/dump_fastq_input" # downloaded fastq output file location
REF_FLAT_FILE: "genome/refFlat_GRCh38.105.txt" # reflat file location built from gtf, for RNAseq metrics option
RRNA_INTERVAL_LIST: "genome/GRCh38.p5.rRNA.interval_list" # rrna interval list file location, for RNAseq metrics
BED_FILE: "genome/Homo_sapiens.GRCh38.105.bed" # reference bedfile build from GTF for RSEQC option
# Quality Control Options
PERFORM_TRIM: True # True or False -- trim adapters prior to aligning
PERFORM_SCREEN: True # True or False -- screen against different genomes for contamination
PERFORM_GET_RNASEQ_METRICS: True # True or False -- get more information about RNA reads (requires a REF_FLAT_FILE, and RRNA_RNA_INTERVAL_LIST!)
# Additional options
PERFORM_PREFETCH: True # True or False -- prefetch sra files before writing to fastq (True if pulling sra directly from GEO Database)
PERFORM_GET_INSERT_SIZE: False # True or False
PERFORM_GET_FRAGMENT_SIZE: True # True or False (required for zFPKM calculation, requires a BED_FILE!)
GENERATE_GENOME:
# The full path where the genome generation data should be saved
# This step will most likely not need to be performed
# The genome is currently generated, do not change this unless necessary
GENOME_SAVE_DIR: "genome/star"
# The full input path of the genome fasta file and the GTF file
# These files are currently present, and any user can read from these files
# These values should not need to be changed unless necessary
GENOME_FASTA_FILE: "genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
GTF_FILE: "genome/Homo_sapiens.GRCh38.105.gtf"
-
MASTER_CONTROL
: The master control file you have. It is a CSV file, consisting of a column with SRR codes, a column with tissue names with study (or batch) number, replicate number, and if applicable, run number, a column with library layouts, and a library preparation column. It is NOT required if you havePERFORM_PREFETCH
set to false ( in which case you would provide the .sra files yourself. An example file is as follows (note the that header should be excluded in your file):
SRR | tissue/tag | Paired End or Single End | Library Preparation (Total or polyA/mRNA enriched) |
---|---|---|---|
SRR7647658 | naiveB_S1R1 | PE | mRNA |
SRR7647700 | naiveB_S1R2 | PE | mRNA |
SRR7647769 | naiveB_S1R3 | PE | mRNA |
SRR7647808 | naiveB_S1R4 | PE | mRNA |
SRR5110334 | naiveB_S2R1 | SE | total |
SRR5110338 | naiveB_S2R2 | SE | total |
SRR5110342 | naiveB_S2R3 | SE | total |
SRR6298332 | nsmB_S1R1 | PE | total |
SRR6298303 | nsmB_S1R2 | PE | total |
SRR6298366 | nsmB_S1R3 | PE | total |
SRR6298274 | nsmB_S1R4 | PE | total |
SRR10408536 | m2Macro_S1R1r1 | SE | total |
SRR10408537 | m2Macro_S1R1r2 | SE | total |
SRR10408538 | m2Macro_S1R1r3 | SE | total |
SRR10408539 | m2Macro_S1R1r4 | SE | total |
SRR10408540 | m2Macro_S1R2r1 | SE | total |
SRR10408541 | m2Macro_S1R2r2 | SE | total |
SRR10408542 | m2Macro_S1R2r3 | SE | total |
SRR10408543 | m2Macro_S1R2r4 | SE | total |
SRR10408544 | m2Macro_S1R3r1 | SE | total |
SRR10408545 | m2Macro_S1R3r2 | SE | total |
SRR10408546 | m2Macro_S1R3r3 | SE | total |
SRR10408547 | m2Macro_S1R3r4 | SE | total |
-
DUMP_FASTQ_FILES
: This option is only required if you have setPERFORM_PREFETCH
toFalse
. It is the location at which your input.fastq.gz
files are located -
ROOTDIR
: The root directory the results should be placed in. This is most likely going to be in your/work
folder -
REF_FLAT_FILE
: path to a reflat file for your reference genome. Can be made using (change depending on genome you are using):
- wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred # download gtf to reflat converter
- chmod =rwx,g+s ./gtfToGenePred # ./gtfToGenePred enable execution permission
- ./gtfToGenePred -genePredExt -geneNameAsName2 genome/Homo_sapiens.GRCh38.105.gtf refFlat.tmp.txt # run
- paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > genome/refFlat_GRCh38.105.txt # modify for picard to parse correctly
- rm refFlat.tmp.txt # delete temp file
-
RRNA_INTERVAL_LIST
: path to a ribosomal interval list built from GTF file for Picard's GetRNASeqMetrics command to find rRNA trasncript quantities. Can be made using (change depending on genome you are using):
- save https://gist.github.com/ag1805x/55ba0d88f317f63423a4e54adc46eb1f
- edit to look like this: https://drive.google.com/drive/u/0/folders/1h8aWJDaqAysdlJwTF_1jSTujfRjNTmvm
- save as 'riboInt.sh'
- sh riboInt.sh
-
BED_FILE
: path to a bedfile for RSeQC, also built from the GTF_FILE corresponding to your reference genome. Can make using (change depending on genome you are using):
- cd genome
- wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v38.bed.gz/download
- mv download hg38_GENCODE.v38.bed.gz
- gunzip hg38_GENCODE.v38.bed.gz
- sed -i 's/"//g' hg38_GENCODE.v38.bed # remove quotation marks from exon positions
- sed -i 's/chr//g' hg38_GENCODE.v38.bed # remove “chr” from chromosome indices
-
PERFORM_TRIM
: Should trim be performed? True or False -
PERFORM_SCREEN
: Screen against genomes of common contaminants? True or False -
PERFORM_GET_RNASEQ_METRICS
: Use Picard's getRNASeqMetrics? True or False. Requires REF_FLAT and RRNA_INTERVAL_LIST. -
PERFORM_PREFETCH
: If you only have the SRR code (fromMASTER_CONTROL
), then this option will download those.sra
files. -
PERFORM_GET_INSERT_SIZE
: Get interval size metrics using Picard? True or False -
GET_FRAGMENT_SIZE
: Get fragment sizes with RSeQC? True or False -
GENOME_SAVE_DIR
: path to directory the genome directory should be saved/output to. This should be under your/work
folder -
GENOME_FASTA_FILE
: The input genome fasta file that has been previously downloaded. This is most likely located under the/work
folder as well. Can download human genome using:
- cd /path/to/genome
- wget ftp://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
-
GTF_FILE
: The input GTF genome file that has also been previously downloaded. Again, this is most likely located under the/work
folder. Can download human genome annotation using:
- cd /path/to/genome
- wget ftp://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
- gunzip Homo_sapiens.GRCh38.105.gtf.gz
You do not need to edit any further files. Snakemake will pull the configurations you have set up in the snakemake_config.yaml
file.
Once these steps are complete, the workflow should be prepared to execute. Continue to the next page to execute the workflow.
Created by Josh Loecker and Brandt Bessell