nlrsnake - running NLRtracker.R on the HPC

This repo contains files that will run NLRtracker.R and its dependencies on the TSL HPC as a snakemake pipeline. It replaces (so never runs) the previous wrapper NLRtracker.sh.

Why snakemake?

Snakemake is a workflow manager. Crucially, it can restart where it left off, so if one step of the pipeline fails, not all previous steps need to be re-run. It manages the steps that need to be run for you. This makes it very useful for bioinformatics pipelines especially this one, where we want to divide some steps into many similar small ones.

The pipeline

The pipeline takes multiple sets of protein sequences then in parallel, for each set:

1. Runs `interprocan.sh` 
2. Runs `hmmer`
3. Runs `fimo`

Once these are done, NLRtracker.R is run and a results folder for each sample created

Interproscan step

The pipeline begins by removing asterisks, splitting the protein file into chunks of specified size and runnning each chunk as
as a separate job on the HPC. Once all the jobs are complete the results are compiled into one large output.

Other steps

All other steps are run individually. Steps run as soon as they are able (that is, the files they rely on are prepared). So fimo and hmmer run alongside interproscan jobs, but NLRtracker.R needs to wait until all its inputs are ready. Snakemake manages this for you.

Running the pipeline

The pipeline is just a collection of scripts, so just pull this repo to a folder in your home area and cd into that folder.

`config.yaml`

The file config.yaml contains job info and needs to be filled in. The pipeline looks for it automatically. It is brief

 scratch: "/tsl/scratch/user_name/nlrsnake/"
 sample_fasta_file: "/hpc-home/user_name/samples_to_fasta.csv"
 seqs_per_file: 6000

* `scratch` is the path of a temporary working directory for all steps of the pipeline. A subdirectory for each sample will be created.
* `sample_fasta_file` is the name of a csv file linking input sequences to sample names
* `seqs_per_file` is the number of sequences per chunk for `interproscan` 6000 is a good number

Try to use the absolute (full length) path in the config.yaml file

`sample_fasta_file`

The format is sample_name,path , one per line. No blank lines at the end of the file. No spaces. Alphanumeric characters and underscores only.

    sample_1,/hpc-home/user_name/seqs1.fa

The submission script

Submitting the snakemake pipeline to the cluster is done from the script scripts/do_nlrtrack.sh.

1. See the help - `bash scripts/do_nlrtrack.sh -h`
2. Do a dryrun (check what remains of the pipeline to run and check everything is in place) - `bash scripts/do_nlrtrack.sh dryrun`
3. Submit the pipeline - `bash scripts/do_nlrtrack.sh`

Checking the pipeline status once submitted

The pipeline creates a master job called nlrtrack and that creates subjobs, this will persist for as long as the pipeline runs. Output from the main process goes into a file nlrtrack.log. Other job output goes into job id specific slurm**.out files.

Output folder

In this pipeline all NLRtracker.R output goes into the folder nlrtracker_out. Other folders (results, ipro_fa and ipro_gff) are interim data files and can be discarded.

Restarting the pipeline

If the pipeline fails (usually because some file wasn't created somewhere), once a fix is made, the pipeline can be restarted from the point it failed. Just redo the submission step bash scripts/do_nlrtrack.sh . The pipeline will go from as far as it got previously, so if you need to redo earlier steps, you need to delete their output files.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
lib		lib
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlrsnake - running NLRtracker.R on the HPC

Why snakemake?

The pipeline

Interproscan step

Other steps

Running the pipeline

`config.yaml`

`sample_fasta_file`

The submission script

Checking the pipeline status once submitted

Output folder

Restarting the pipeline

About

Releases

Packages

Contributors 2

Languages

TeamMacLean/nlrsnake

Folders and files

Latest commit

History

Repository files navigation

nlrsnake - running NLRtracker.R on the HPC

Why snakemake?

The pipeline

Interproscan step

Other steps

Running the pipeline

config.yaml

sample_fasta_file

The submission script

Checking the pipeline status once submitted

Output folder

Restarting the pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`config.yaml`

`sample_fasta_file`

Packages