Skip to content
Josh Loecker edited this page Mar 29, 2022 · 14 revisions

Overview

This is a fairly in-depth explanation on what needs to be done to execute the workflow. The general overview is:

  1. Create a conda environment that contains the snakemake and mamba packages
  2. Create a snakemake profile
  3. Modify workflow configuration values OR follow the slides here for a slightly alternative approach: https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g11366b6085b_0_0

Screen

Unfortunately, Snakemake does not offer a method of closing the terminal while keeping the jobs running. This makes sense, as the main snakemake --profile slurm command is tied directly to the main terminal process. To overcome this, we will simply start a screen session. This allows us to close the main terminal window, while keeping our SSH connection/instance alive.

Alternatively, you can run snakemake in a bash script submitted to SLURM, as explained in https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g11366b6085b_0_0

First, set a large scrollback for screen, so we can view more lines after we have detached from the terminal. Execute the following:
echo "defscrollback 10000" >> ~/.screenrc

The Execution Section of the Running page will show more about using screen with snakemake

Conda Environment

NOTE: These steps can take quite a long time

Choose a name for your conda environment. This tutorial wil be using snakemake for its environment name

  1. Create a conda environment: conda create -n snakemake. Optionally, you can change the default location of the conda environment by following the prompt
  2. Activate the conda environment: conda activate snakemake
  3. Install mamba: conda install -n snakemake -c conda-forge mamba. This is recommended by Snakemake, as Mamba is much faster, and Conda occationally has issues installing the most recent Snakemake version.
    1. conda: The conda command
    2. install: Install a package
    3. -n snakemake: Install a package into a specific conda environment
    4. -c conda-forge: Use the conda-forge channel to install the package
    5. mamba: The package to install
  4. Install snakemake using mamba: mamba install -c bioconda -c conda-forge snakemake
    1. mamba: The mamba command
    2. install: Install a package
    3. -c bioconda -c conda-forge: Use the bioconda and conda-forge packages
    4. snakemake: The package to install
  5. Install CookieCutter (if it was not already installed): pip install cookiecutter
  6. Test snakemake: snakemake --version. Ensure no errors occur, and the snakemake version is seen. The most recent version can be seen here

Snakemake Profile Setup

(Reference)

This section will assume you are setting up profiles for SLURM. If you are not, please reference the Snakemake Profile's GitHub Page and select your cluster's scheduler

Snakemake's profiles will allow us to submit jobs to slurm without having to write SLURM scripts.

  1. Create the following directory with: mkdir -p ~/.config/snakemake/
  2. Change to this directory, and execute the following: cookiecutter https://github.com/Snakemake-Profiles/slurm.git
    1. You do not have to enter any specific details, simply press enter until this specific setup is complete
  3. Copy and paste the contents of the "config.yaml codeblock" codeblock into the config.yaml file
  4. Change values as you like, but the ones listed are generally good-to-go
  5. If you do decide to change value, DO NOT modify use-conda, or conda-frontend. If these values are changed away from their current setting, snakemake WILL break
  6. Once this is done, create a new file named cluster_config.yaml
    1. Enter the data under the # cluster_config.yaml codeblock into it
    2. This will define where output from SLURM will go. It does not need to be changed, however, feel free to change it if you would like
# config.yaml codeblock

# Default Values
restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60

# User-modified settings
jobs: 100
printshellcmds: True
max-jobs-per-second: 10  # Default is 1 job per second

# DO NOT MODIFY THESE VALUES
# Snakemake will break if these are changed from their current setting
use-conda: True
conda-frontend: mamba
# cluster_config.yaml

__default__ :
   job-name  : "{rule}.{wildcards}"
   ntasks    : "1"
   cpus-per-task : "{threads}"
   nodes     : "1"
   output : "logs/{rule}/{rule}.{wildcards.tissue_name}.{wildcards.tag}.output"
   error  : "logs/{rule}/{rule}.{wildcards.tissue_name}.{wildcards.tag}.output"

Workflow Configuration

When the workflow was first downloaded (from The Workflow section), a snakemake_configuration.yaml file was downloaded. Open this file and change the values to your needs. To make the BED_FILE, RRNA_INTERVAL_LIST, and REF_FLAT_FILE see slides 3 and 4 at https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g111a3589bd3_0_0 with examples for human reference genome.

# Modify these settings to reflect your paths
MASTER_CONTROL: "controls/master_control.csv"  # csv with srr, sampleName_SXRYrZ, layout, prep method
ROOTDIR: "results" # redirect output root directory
DUMP_FASTQ_FILES: "results/dump_fastq_input" # downloaded fastq output file location
REF_FLAT_FILE: "genome/refFlat_GRCh38.105.txt"  # reflat file location built from gtf, for RNAseq metrics option
RRNA_INTERVAL_LIST: "genome/GRCh38.p5.rRNA.interval_list"  # rrna interval list file location, for RNAseq metrics
BED_FILE: "genome/Homo_sapiens.GRCh38.105.bed"  # reference bedfile build from GTF for RSEQC option 

# Quality Control Options
PERFORM_TRIM: True  # True or False -- trim adapters prior to aligning
PERFORM_SCREEN: True # True or False -- screen against different genomes for contamination
PERFORM_GET_RNASEQ_METRICS: True # True or False -- get more information about RNA reads (requires a REF_FLAT_FILE, and RRNA_RNA_INTERVAL_LIST!)

# Additional options
PERFORM_PREFETCH: True  # True or False -- prefetch sra files before writing to fastq (True if pulling sra directly from GEO Database)
PERFORM_GET_INSERT_SIZE: False # True or False 
PERFORM_GET_FRAGMENT_SIZE: True # True or False (required for zFPKM calculation, requires a BED_FILE!)

GENERATE_GENOME:
  # The full path where the genome generation data should be saved
  # This step will most likely not need to be performed
  # The genome is currently generated, do not change this unless necessary
  GENOME_SAVE_DIR: "genome/star"

  # The full input path of the genome fasta file and the GTF file
  # These files are currently present, and any user can read from these files
  # These values should not need to be changed unless necessary
  GENOME_FASTA_FILE: "genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
  GTF_FILE: "genome/Homo_sapiens.GRCh38.105.gtf"
  1. MASTER_CONTROL: The master control file you have. It is a CSV file, consisting of a column with SRR codes, a column with tissue names with study (or batch) number, replicate number, and if applicable, run number, a column with library layouts, and a library preparation column. It is NOT required if you have PERFORM_PREFETCH set to false ( in which case you would provide the .sra files yourself. An example file is as follows (note the that header should be excluded in your file):
SRR tissue/tag Paired End or Single End Library Preparation (Total or polyA/mRNA enriched)
SRR7647658 naiveB_S1R1 PE mRNA
SRR7647700 naiveB_S1R2 PE mRNA
SRR7647769 naiveB_S1R3 PE mRNA
SRR7647808 naiveB_S1R4 PE mRNA
SRR5110334 naiveB_S2R1 SE total
SRR5110338 naiveB_S2R2 SE total
SRR5110342 naiveB_S2R3 SE total
SRR6298332 nsmB_S1R1 PE total
SRR6298303 nsmB_S1R2 PE total
SRR6298366 nsmB_S1R3 PE total
SRR6298274 nsmB_S1R4 PE total
SRR10408536 m2Macro_S1R1r1 SE total
SRR10408537 m2Macro_S1R1r2 SE total
SRR10408538 m2Macro_S1R1r3 SE total
SRR10408539 m2Macro_S1R1r4 SE total
SRR10408540 m2Macro_S1R2r1 SE total
SRR10408541 m2Macro_S1R2r2 SE total
SRR10408542 m2Macro_S1R2r3 SE total
SRR10408543 m2Macro_S1R2r4 SE total
SRR10408544 m2Macro_S1R3r1 SE total
SRR10408545 m2Macro_S1R3r2 SE total
SRR10408546 m2Macro_S1R3r3 SE total
SRR10408547 m2Macro_S1R3r4 SE total
  1. DUMP_FASTQ_FILES: This option is only required if you have set PERFORM_PREFETCH to False. It is the location at which your input .fastq.gz files are located
  2. ROOTDIR: The root directory the results should be placed in. This is most likely going to be in your /work folder
  3. REF_FLAT_FILE: path to a reflat file for your reference genome. Can be made using (change depending on genome you are using):
  • wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred # download gtf to reflat converter
  • chmod =rwx,g+s ./gtfToGenePred # ./gtfToGenePred enable execution permission
  • ./gtfToGenePred -genePredExt -geneNameAsName2 genome/Homo_sapiens.GRCh38.105.gtf refFlat.tmp.txt # run
  • paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > genome/refFlat_GRCh38.105.txt # modify for picard to parse correctly
  • rm refFlat.tmp.txt # delete temp file
  1. RRNA_INTERVAL_LIST: path to a ribosomal interval list built from GTF file for Picard's GetRNASeqMetrics command to find rRNA trasncript quantities. Can be made using (change depending on genome you are using):
  1. BED_FILE: path to a bedfile for RSeQC, also built from the GTF_FILE corresponding to your reference genome. Can make using (change depending on genome you are using):
  1. PERFORM_TRIM: Should trim be performed? True or False
  2. PERFORM_SCREEN: Screen against genomes of common contaminants? True or False
  3. PERFORM_GET_RNASEQ_METRICS: Use Picard's getRNASeqMetrics? True or False. Requires REF_FLAT and RRNA_INTERVAL_LIST.
  4. PERFORM_PREFETCH: If you only have the SRR code (from MASTER_CONTROL), then this option will download those .sra files.
  5. PERFORM_GET_INSERT_SIZE: Get interval size metrics using Picard? True or False
  6. GET_FRAGMENT_SIZE: Get fragment sizes with RSeQC? True or False
  7. GENOME_SAVE_DIR: path to directory the genome directory should be saved/output to. This should be under your /work folder
  8. GENOME_FASTA_FILE: The input genome fasta file that has been previously downloaded. This is most likely located under the /work folder as well. Can download human genome using:
  1. GTF_FILE: The input GTF genome file that has also been previously downloaded. Again, this is most likely located under the /work folder. Can download human genome annotation using:

You do not need to edit any further files. Snakemake will pull the configurations you have set up in the snakemake_config.yaml file.


Once these steps are complete, the workflow should be prepared to execute. Continue to the next page to execute the workflow.



Go back to Download
Go forward to Running

Clone this wiki locally