Setup

Overview

This is a fairly in-depth explanation on what needs to be done to execute the workflow. The general overview is:

Create a conda environment that contains the snakemake and mamba packages
Create a snakemake profile
Modify workflow configuration values OR follow the slides here for a slightly alternative approach: https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g11366b6085b_0_0

Screen

Unfortunately, Snakemake does not offer a method of closing the terminal while keeping the jobs running. This makes sense, as the main snakemake --profile slurm command is tied directly to the main terminal process. To overcome this, we will simply start a screen session. This allows us to close the main terminal window, while keeping our SSH connection/instance alive.

Alternatively, you can run snakemake in a bash script submitted to SLURM, as explained in https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g11366b6085b_0_0

First, set a large scrollback for screen, so we can view more lines after we have detached from the terminal. Execute the following:
echo "defscrollback 10000" >> ~/.screenrc

The Execution Section of the Running page will show more about using screen with snakemake

Conda Environment

NOTE: These steps can take quite a long time

Choose a name for your conda environment. This tutorial wil be using snakemake for its environment name

Create a conda environment: conda create -n snakemake. Optionally, you can change the default location of the conda environment by following the prompt
Activate the conda environment: conda activate snakemake
Install mamba: conda install -n snakemake -c conda-forge mamba. This is recommended by Snakemake, as Mamba is much faster, and Conda occationally has issues installing the most recent Snakemake version.
1. conda: The conda command
2. install: Install a package
3. -n snakemake: Install a package into a specific conda environment
4. -c conda-forge: Use the conda-forge channel to install the package
5. mamba: The package to install
Install snakemake using mamba: mamba install -c bioconda -c conda-forge snakemake
1. mamba: The mamba command
2. install: Install a package
3. -c bioconda -c conda-forge: Use the bioconda and conda-forge packages
4. snakemake: The package to install
Install CookieCutter (if it was not already installed): pip install cookiecutter
Test snakemake: snakemake --version. Ensure no errors occur, and the snakemake version is seen. The most recent version can be seen here

Snakemake Profile Setup

(Reference)

This section will assume you are setting up profiles for SLURM. If you are not, please reference the Snakemake Profile's GitHub Page and select your cluster's scheduler

Snakemake's profiles will allow us to submit jobs to slurm without having to write SLURM scripts.

Create the following directory with: mkdir -p ~/.config/snakemake/
Change to this directory, and execute the following: cookiecutter https://github.com/Snakemake-Profiles/slurm.git
1. You do not have to enter any specific details, simply press enter until this specific setup is complete
Copy and paste the contents of the "config.yaml codeblock" codeblock into the config.yaml file
Change values as you like, but the ones listed are generally good-to-go
If you do decide to change value, DO NOT modify use-conda, or conda-frontend. If these values are changed away from their current setting, snakemake WILL break
Once this is done, create a new file named cluster_config.yaml
1. Enter the data under the # cluster_config.yaml codeblock into it
2. This will define where output from SLURM will go. It does not need to be changed, however, feel free to change it if you would like

# config.yaml codeblock

# Default Values
restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60

# User-modified settings
jobs: 100
printshellcmds: True
max-jobs-per-second: 10  # Default is 1 job per second

# DO NOT MODIFY THESE VALUES
# Snakemake will break if these are changed from their current setting
use-conda: True
conda-frontend: mamba

# cluster_config.yaml

__default__ :
   job-name  : "{rule}.{wildcards}"
   ntasks    : "1"
   cpus-per-task : "{threads}"
   nodes     : "1"
   output : "logs/{rule}/{rule}.{wildcards.tissue_name}.{wildcards.tag}.output"
   error  : "logs/{rule}/{rule}.{wildcards.tissue_name}.{wildcards.tag}.output"

Workflow Configuration

When the workflow was first downloaded (from The Workflow section), a snakemake_configuration.yaml file was downloaded. Open this file and change the values to your needs. To make the BED_FILE, RRNA_INTERVAL_LIST, and REF_FLAT_FILE see slides 3 and 4 at https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g111a3589bd3_0_0 with examples for human reference genome.

# Modify these settings to reflect your paths
MASTER_CONTROL: "controls/master_control.csv"  # csv with srr, sampleName_SXRYrZ, layout, prep method
ROOTDIR: "results" # redirect output root directory
DUMP_FASTQ_FILES: "results/dump_fastq_input" # downloaded fastq output file location
REF_FLAT_FILE: "genome/refFlat_GRCh38.105.txt"  # reflat file location built from gtf, for RNAseq metrics option
RRNA_INTERVAL_LIST: "genome/GRCh38.p5.rRNA.interval_list"  # rrna interval list file location, for RNAseq metrics
BED_FILE: "genome/Homo_sapiens.GRCh38.105.bed"  # reference bedfile build from GTF for RSEQC option 

# Quality Control Options
PERFORM_TRIM: True  # True or False -- trim adapters prior to aligning
PERFORM_SCREEN: True # True or False -- screen against different genomes for contamination
PERFORM_GET_RNASEQ_METRICS: True # True or False -- get more information about RNA reads (requires a REF_FLAT_FILE, and RRNA_RNA_INTERVAL_LIST!)

# Additional options
PERFORM_PREFETCH: True  # True or False -- prefetch sra files before writing to fastq (True if pulling sra directly from GEO Database)
PERFORM_GET_INSERT_SIZE: False # True or False 
PERFORM_GET_FRAGMENT_SIZE: True # True or False (required for zFPKM calculation, requires a BED_FILE!)

GENERATE_GENOME:
  # The full path where the genome generation data should be saved
  # This step will most likely not need to be performed
  # The genome is currently generated, do not change this unless necessary
  GENOME_SAVE_DIR: "genome/star"

  # The full input path of the genome fasta file and the GTF file
  # These files are currently present, and any user can read from these files
  # These values should not need to be changed unless necessary
  GENOME_FASTA_FILE: "genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
  GTF_FILE: "genome/Homo_sapiens.GRCh38.105.gtf"

MASTER_CONTROL: The master control file you have. It is a CSV file, consisting of a column with SRR codes, a column with tissue names with study (or batch) number, replicate number, and if applicable, run number, a column with library layouts, and a library preparation column. It is NOT required if you have PERFORM_PREFETCH set to false ( in which case you would provide the .sra files yourself. An example file is as follows (note the that header should be excluded in your file):

SRR	tissue/tag	Paired End or Single End	Library Preparation (Total or polyA/mRNA enriched)
SRR7647658	naiveB_S1R1	PE	mRNA
SRR7647700	naiveB_S1R2	PE	mRNA
SRR7647769	naiveB_S1R3	PE	mRNA
SRR7647808	naiveB_S1R4	PE	mRNA
SRR5110334	naiveB_S2R1	SE	total
SRR5110338	naiveB_S2R2	SE	total
SRR5110342	naiveB_S2R3	SE	total
SRR6298332	nsmB_S1R1	PE	total
SRR6298303	nsmB_S1R2	PE	total
SRR6298366	nsmB_S1R3	PE	total
SRR6298274	nsmB_S1R4	PE	total
SRR10408536	m2Macro_S1R1r1	SE	total
SRR10408537	m2Macro_S1R1r2	SE	total
SRR10408538	m2Macro_S1R1r3	SE	total
SRR10408539	m2Macro_S1R1r4	SE	total
SRR10408540	m2Macro_S1R2r1	SE	total
SRR10408541	m2Macro_S1R2r2	SE	total
SRR10408542	m2Macro_S1R2r3	SE	total
SRR10408543	m2Macro_S1R2r4	SE	total
SRR10408544	m2Macro_S1R3r1	SE	total
SRR10408545	m2Macro_S1R3r2	SE	total
SRR10408546	m2Macro_S1R3r3	SE	total
SRR10408547	m2Macro_S1R3r4	SE	total

DUMP_FASTQ_FILES: This option is only required if you have set PERFORM_PREFETCH to False. It is the location at which your input .fastq.gz files are located
ROOTDIR: The root directory the results should be placed in. This is most likely going to be in your /work folder
REF_FLAT_FILE: path to a reflat file for your reference genome. Can be made using (change depending on genome you are using):

wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred # download gtf to reflat converter

chmod =rwx,g+s ./gtfToGenePred # ./gtfToGenePred enable execution permission

./gtfToGenePred -genePredExt -geneNameAsName2 genome/Homo_sapiens.GRCh38.105.gtf refFlat.tmp.txt # run

paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > genome/refFlat_GRCh38.105.txt # modify for picard to parse correctly

rm refFlat.tmp.txt # delete temp file

RRNA_INTERVAL_LIST: path to a ribosomal interval list built from GTF file for Picard's GetRNASeqMetrics command to find rRNA trasncript quantities. Can be made using (change depending on genome you are using):

save https://gist.github.com/ag1805x/55ba0d88f317f63423a4e54adc46eb1f

edit to look like this: https://drive.google.com/drive/u/0/folders/1h8aWJDaqAysdlJwTF_1jSTujfRjNTmvm

save as 'riboInt.sh'

sh riboInt.sh

BED_FILE: path to a bedfile for RSeQC, also built from the GTF_FILE corresponding to your reference genome. Can make using (change depending on genome you are using):

cd genome

wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v38.bed.gz/download

mv download hg38_GENCODE.v38.bed.gz

gunzip hg38_GENCODE.v38.bed.gz

sed -i 's/"//g' hg38_GENCODE.v38.bed # remove quotation marks from exon positions

sed -i 's/chr//g' hg38_GENCODE.v38.bed # remove “chr” from chromosome indices

PERFORM_TRIM: Should trim be performed? True or False
PERFORM_SCREEN: Screen against genomes of common contaminants? True or False
PERFORM_GET_RNASEQ_METRICS: Use Picard's getRNASeqMetrics? True or False. Requires REF_FLAT and RRNA_INTERVAL_LIST.
PERFORM_PREFETCH: If you only have the SRR code (from MASTER_CONTROL), then this option will download those .sra files.
PERFORM_GET_INSERT_SIZE: Get interval size metrics using Picard? True or False
GET_FRAGMENT_SIZE: Get fragment sizes with RSeQC? True or False
GENOME_SAVE_DIR: path to directory the genome directory should be saved/output to. This should be under your /work folder
GENOME_FASTA_FILE: The input genome fasta file that has been previously downloaded. This is most likely located under the /work folder as well. Can download human genome using:

cd /path/to/genome

wget ftp://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

GTF_FILE: The input GTF genome file that has also been previously downloaded. Again, this is most likely located under the /work folder. Can download human genome annotation using:

cd /path/to/genome

wget ftp://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz

gunzip Homo_sapiens.GRCh38.105.gtf.gz

You do not need to edit any further files. Snakemake will pull the configurations you have set up in the snakemake_config.yaml file.

Once these steps are complete, the workflow should be prepared to execute. Continue to the next page to execute the workflow.

Go back to Download
Go forward to Running

Created by Josh Loecker and Brandt Bessell

Provide feedback

Saved searches