- Information on how to install NextFlow can be found here.
- Information on how to install Singularity cna be found here
git clone https://github.com/UK-SBCoA-EbbertLab/cDNA_pipeline
- Alter slurm job manager (or other job manager) parameters to suit your local environment. I don't recommend changing the memory/cpu/time allocated for the job manager.
5) Make sure you have all the sequencing files and reference genomes/assemblies files and annotation files you will need to run the pipeline.
-
".fastq" -- Nanopore cDNA sequencing files or ".bam" alignment files. Could also be the small ".fastq.gz" files for each sample, see the
--path
option below if you are processing files as they are outputted from basecalling in an Oxford Nanopore Sequencer. Could also be ".pod5" or ".fast5" files if you are starting from the basecalling step. -
"sequencing_summary.txt" -- These files are not necessary for execution, but if not available the PycoQC quality control step will be skipped.
-
refecence/assembly ".fa" file.
-
annotation ".gtf" file is preffered. Only use ".gff3" if using CHM13. Pipeline has an option to handle this, see
Pipeline parameters for STEP 2
.
Many of the parameters for this step are based on dorado basecaller, see their documentation to understand it better.
PS: I took care to pass the base modification information to the FASTQ headers so that it is still contained even in the fastq files. I prefer outputting fastq files because it allows more flexibility in downstream alignment and processing.
--step <"1". Performs step 1>
--basecall_path <path to base directory containing all fast5 and/or pod5 files you want to basecall.
It will automatically separate samples based on naming conventions and
directory structure. example: /sequencing_run/">
--basecall_speed <"fast", "hac", "sup". Default = "hac">
--basecall_mods <Comma separated list of base modifications you want to basecall. See dorado docs
for more information. Example: "5mCG_5hmCG,5mC_5hmC,6mA". Default: "False">
--basecall_compute <"gpu", "cpu". Default: "gpu". Allows users to choose to basecall with
CPU or GPU. CPU basecalling is super slow and should only be used for small
test datasets when GPUs are not available. Also, use --basecall_speed "fast"
when basecalling with CPU to make it less slow. Default: "gpu">
--basecall_config <configuration name for basecalling setting. This is not necessary since dorado
is able to automatically determine the appropriate configuration. When set to "false"
the basecaller will automatically pick the basecall configuration.
Example: "dna_r9.4.1_e8_hac@v3.3". Default: "false">
--basecall_trim <"all", "primers", "adapters", "none". Default: "none". Note that as of December 2023
dorado always trims the adapters during basecalling for directRNA data>
--qscore_thresh <Mean quality score threshold for basecalled reads. Default: 9>
--demux <"True", "False". Whether you want the data to be demultiplexed
setting it to "True" will perform demultiplexing>
--trim_barcodes <"True", "False". Only relevant is --demux is set to "True".
if set to "True" barcodes will be trimmed during demultiplexing
and will not be present in output "fastq" files>
--prefix <Will add a prefix to the beggining of your filenames, good
when wanting to keep track of batches of data.
Example: "Batch_1". Default value is "None" which does not add any prefixes>
--out_dir <Name of output directory. Output files/directories will be output to
"./results/<out_dir>/". Default: "output_directory">
--step <"2". Performs step 2>
--ont_reads_fq <path to fastq sequencing data, can submit multiple at once using pattern "/path/*.fastq". If you don't specify this
parameter the pipeline will not run. An exception is if you specify a ".bam" file input, see Pipeline parameters for
STEP 2 from BAM below. Default: "None">
--ont_reads_txt <path to sequencing summary files, can submit multiple at once. Make sure they follow same naming pattern as fastq
files. If a fastq file is named "sample_1.fastq" then its sequencing summary file should be "sample_1.txt"
or "sample_1_sequencing_summary.txt". If fastq files and sequencing summary files don't follow the same naming pattern
they might be paired erroneously, which will lead to erronous results from quality control steps. If you don't specify
this parameter pipeline will run, but skip PycoQC quality control step. Default: "None">
--path <path to ONT files as exported by the basecaller from the PromethION, GridION, of MinION devices. It takes a path to one or
multiple samples. For example if you had two samples in "/dir1/dir2/sample_1/" and "/dir1/dir2/sample_2/" you would pass
"/dir1/dir2/" to have both samples be processed. Using the "path" parameter will make the pipeline unzip all the smaller
".fastq.gz" files for each sample and concatenate them into one ".fastq" file for each sample. This parametes will also
automatically search for the "sequencing_summary.txt" files for each sample, this file must be present for all samples
for this parameter to work properly. If you pass this path parameter, you don't have to pass the "ont_reads_txt" and the
"ont_reads_fastq" parameters, in fact those will get ignored even if you pass them. The goal of this parameter is to
make it easier on the user so that you don't have to manually unzip and concatenate files. It also automatically
matches the naming pattern for the fastq and the sequencing summary file. This parameter assumes that a directory with
the name of the sample is two directories below the directory, where the sequencing summary file is, for example:
"/dir1/dir2/sample_1/unique_flowcell_id/seq_summary.txt". The pipeline will still work if this is not the case and each
sample file will still have a unique ID, but the unique sample ID will be preappended with the whatever name the
directory two steps below the sequencing_summary.txt file is named. For example, if your directory structure is:
"/dir1/dir2/dir3/unique_flowcell_id/sequencing_summary.txt", your files will be named "dir3_unique_flowcell_id.fastq"
and "dir3_unique_flowcell_id.txt" (sequencing summary file). The ".fastq.gz" files must be in a subdirectory located in the
same directory as the sequencing_summary.txt file, for example: "/dir1/dir2/sample_1/unique_flowcell_id/seq_summary.txt",
"/dir1/dir2/sample_1/unique_flowcell_id/fastq_pass/", "/dir1/dir2/sample_1/unique_flowcell_id/fastq_fail/". Default: "None">
--ref <path to reference/assembly ".fa" file. if using ERCC make sure to concatenate it to the end of the file.
Default: "None">
--annotation <path to reference annotation ".gtf" file for GRCh38 or ".gff3" for CHM13. If using GRCh38 and ERCC concatenate the
two ".gtf" files prior to running the pipeline. If using ERCC with CHM13 make sure to set --is_chm13 to "True" and
set --ercc to the path of your ERCC ".gtf" file. Default: "None">
--out_dir <name of output directory for pipeline submission. Will appear under "results/<out_dir>" on the directory the pipeline
was submitted from Default: "output_directory/">
--ercc <path to ERCC annotation file. Only needed if using CHM13 reference + GFF3 annotation file and adding ERCC. Otherwise
do not specify this parameter. Default: "None">
--cdna_kit <option for pychopper trimming using adapters from the specific cDNA library version, current options are "PCS109",
"PCS110", "PCS111", "PCS114", "LSK114". Default: "PCS111">
--is_chm13 <logical, set to "True" if using CHM13 and "False" if not. Fixes CHM13 annotation for compatibility with Bambu and
converts to ".gtf". Default: "False">
--housekeeping <path to bed file containing coordinates for housekeeping genes used by RSeQC quality control tool. Currently only
supported if --is_chm13 is set to "False". You can find the bed file with housekeeping gene coordinates for GRCh38
here: https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38.HouseKeepingGenes.bed.gz/download
Default: "None">
--mapq <Integer, set it to the number you want to be used to filter ".bam" file by mapq. --mapq 10 filters out reads with MAPQ
< 10. set it to 0 if don't want to filter out any reads. Default: 0>
--is_dRNA <Logical, set to "True" if you want to run the minimap2 step with the parameters for "noisy" Nanopore directRNAseq reads.
Set to "False" if you want to run it with minimap2 step for PCR amplified Nanopore cDNA reads. If you set this parameter
to "True" you should omit the "--cdna_kit" parameter as that is only used for the Pychopper step of the pipeline and that
step is skipped for directRNAseq. Note that this is the only step in the pipeline that is specific for
DirectRNAseq, Step 2 from BAM and Step 3 are not modified for dRNA vs cDNA. Step 1 will need the
specific basecalling configuration for dRNA instead of cDNA if you are running dRNA. Default: "False">
--trim_dRNA <Logical, set to "True" if you want to trim dRNA adapters using porechop_ABI.
It will automatically find the adapters in your dRNA adapters, see this link:
https://github.com/bonsai-team/Porechop_ABI/issues/19 for caveats.
Default: "False">
--contamination_ref <Path to a ".fasta" reference containing any contaminants you wish to align the unmapped reads against. For a
reference containing the contaminants we use in house got to the "contamination_reference_doc" folder in
this GitHub or to this link on Zenodo: https://zenodo.org/deposit/8350277. If you pass a contamination
reference a report will be generated with multiQC showing what percent of reads mapped to a contaminant.
For further detail go into the "contamination_report/" folder under your pipeline execution results
to find out which were the most abundant contaminants in each of your samples.>
--qscore_thresh <Minimum mean base quality from basecalled sequence. For example, when set at 9 all reads with
mean base quality below 9 will be filtered out from the fastq file. IMPORTANT NOTE: There is a good chance
that the fastq file you are using only contains "pass" reads, which are usually Qscore >= 9, therefore filtering by
a Qscore < 9 will still result in only having reads with Qscore >= 9. If you want to filter by a lower Qscore make sure
your fastq file has not been previously filtered. Default: 9>
--track_reads <logical, set to "True" if you want Bambu to keep track of read assignments to transcripts in the output ".RDS" file
from Bambu. Set to "False" if you don't need to keep track of read assignments (smaller files). Default: "False">
--prefix <Will add a prefix to the beggining of your filenames, good
when wanting to keep track of batches of data.
Example: "Batch_1". Default value is "None" which does not add any prefixes>
--rseqc_tin <logical, set to "True" if you want to perform RSEQC TIN analysis. This is a time consuming
step. Setting it to "True" will only work if you also have passed a reference to the
"housekeeping" parameter. Default: "False">
--step <"2". Performs step 2>
--bam <path to ".bam" alignment files, using this will supersed any ".fastq" input>
--bai <path to ".bai" alignment index files, using this will supersed any ".fastq" input>
--ref <path to reference/assembly ".fa" file. if using ERCC make sure to concatenate it to the end of the file.
Default: "None">
--annotation <path to reference annotation ".gtf" file for GRCh38 or ".gff3" for CHM13. If using GRCh38 and ERCC concatenate the two
".gtf" files prior to running the pipeline. If using ERCC with CHM13 make sure to set --is_chm13 to "True" and set
--ercc to the path of your ERCC ".gtf" file. Default: "None">
--out_dir <name of output directory for pipeline submission. Will appear under "results/<out_dir>" on the directory the pipeline
was submitted from. Default: "output_directory/">
--ercc <path to ERCC annotation file. Only needed if using CHM13 reference + GFF3 annotation file and adding ERCC.
Otherwise do not specify this parameter. Default: "None">
--is_chm13 <logical, set to "True" if using CHM13 and "False" if not. Fixes CHM13 annotation for compatibility with Bambu and
converts to ".gtf". Default: "False">
--track_reads <logical, set to "True" if you want Bambu to keep track of read assignments to transcripts in the output ".RDS" file
from Bambu. Set to "False" if you don't need to keep track of read assignments (smaller files). Default: "False">
--prefix <Will add a prefix to the beggining of your filenames, good
when wanting to keep track of batches of data.
Example: "Batch_1". Default value is "None" which does not add any prefixes>
--mapq <integer, set it to the number you want to be used to filter ".bam" file by mapq. --mapq 10 filters out reads with
MAPQ < 10. Set it to 0 if don't want to filter out any reads. Default: 0>
--bambu_rds <path to individually pre-processed bambu RDS objects (output from step 1). Default: "None"
--ref <path to reference/assembly ".fa" file. if using ERCC make sure to concatenate it to the end of the file.
Default:"None">
--annotation <path to reference annotation ".gtf" file for GRCh38 or ".gff3" for CHM13. If using GRCh38 and ERCC concatenate the two
".gtf" files prior to running the pipeline. If using ERCC with CHM13 make sure to set --is_chm13 to "True" and set
--ercc to the path of your ERCC ".gtf" file. Default: "None">
--is_chm13 <logical, set to "True" if using CHM13 and "False" if not. Fixes CHM13 annotation for compatibility with Bambu and
converts to ".gtf". Default: "False">
--fai <path to reference index ".fai" file>
--out_dir <name of output directory for pipeline submission. Will appear under "results/<out_dir>" on the directory the pipeline
was submitted from. Default: "output_directory/">
--is_discovery <Logical, if "True" perform transcript discovery and quantification with Bambu, else if "False" perform only
quantification based on the given GTF annotation>
--NDR <NDR parameter for performing transcript discovery with Bambu. Values can range from 0-1, the closer to 0 the more
stringent the transcript discovery is. Meaning you will discover less transcripts but be more certain that they are
real. "auto" option lets Bambu determine which NDR threshold to use based on the data. Authors of the tool recommend
"auto" for most situations. Default: "auto">
--track_reads <logical, set to "True" if you want Bambu to keep track of read assignments to transcripts in the output ".RDS" file
from Bambu. Set to "False" if you don't need to keep track of read assignments (smaller files). Default: "False">
--intermediate_qc <path to directory containing intermediate quality control output from pipeline step 2.
Example: "/path/intermediate_qc_reports/" Default: "None">
--multiqc_input <path to directory containing multiqc input data from pipeline step 2. Use path and add **.
Example: /path/multiqc_input/** Default: "None">
--multiqc_config <path to multiqc ".yaml" config file. Default: "None". PS: You can find an example of a multiqc config file that works
for this pipeline under /cDNA_pipeline/workflow/bin/multiqc_config.yaml>
nextflow ../main.nf --step 1 --basecall_path ../../test_dRNA_data/basecalling_2/" \
--basecall_speed "fast" \
--basecall_mods "False" \
--basecall_config "False" \
--basecall_trim "none" \
--basecall_compute "cpu" \
--basecall_demux "True" \
--outdir "test_basecall_dRNA_1"
nextflow ../main.nf --step 2 \
--ont_reads_fq "../ont_data/test_data/*.fastq" \
--ont_reads_txt "../ont_data/test_data/*.txt" \
--ref "../../references/chm13v2.0.fa" \
--annotation "../../references/CHM13.v2.0.gff3" \
--ercc "None" \
--out_dir "./CHM13_test/" \
--cdna_kit "PCS111" \
--is_chm13 "True" \
--track_reads "False" \
--mapq "0"
nextflow ../main.nf --step 2 \
--ont_reads_fq "../ont_data/test_data/*.fastq" \
--ont_reads_txt "../ont_data/test_data/*.txt" \
--ref "../../references/chm13v2.0_ERCC.fa" \
--annotation "../../references/CHM13.v2.0.gff3" \
--ercc "../../references/ERCC92.gtf" \
--out_dir "./CHM13_ERCC_test/" \
--cdna_kit "PCS111" \
--is_chm13 "True" \
--track_reads "False" \
--mapq "0"
Notice that for CHM13 you need to concatenate CHM13 and the ERCC reference prior to submitting the pipeline, but the annotations are entered separately and concatenated by the program itself after converting CHM13 annotation to ".gtf" format.
nextflow ../main.nf --step 2 \
--ont_reads_fq "../ont_data/test_data/*.fastq" \
--ont_reads_txt "../ont_data/test_data/*.txt" \
--ref "../../references/Homo_sapiens.GRCh38.dna.primary_assembly.fa" \
--annotation "../../references/Homo_sapiens.GRCh38.106.gtf" \
--out_dir "./GRCh38_test/" \
--cdna_kit "PCS111" \
--is_chm13 "False" \
--track_reads "False" \
--mapq "0" \
--housekeeping "../../references/hg38.HouseKeepingGenes.bed"
nextflow ../main.nf --step 2 \
--ont_reads_fq "../ont_data/test_data/*.fastq" \
--ont_reads_txt "../ont_data/test_data/*.txt" \
--ref "../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--annotation "../../references/Homo_sapiens.GRCh38.106_ERCC.gtf" \
--out_dir "./GRCh38_ERCC_test/" \
--cdna_kit "PCS111" \
--is_chm13 "False" \
--track_reads "False" \
--mapq "0" \
--housekeeping "../../references/hg38.HouseKeepingGenes.bed"
nextflow ../main.nf --step 2 \
--ont_reads_fq "../ont_data/test_data/*.fastq" \
--ont_reads_txt "../ont_data/test_data/*.txt" \
--ref "../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--annotation "../../references/Homo_sapiens.GRCh38.106_ERCC.gtf" \
--out_dir "./GRCh38_ERCC_test/" \
--cdna_kit "PCS111" \
--is_chm13 "False" \
--track_reads "False" \
--mapq "0" \
--housekeeping "../../references/hg38.HouseKeepingGenes.bed" \
--contamination_ref "../../references/contamination_ref.fasta"
Example for step 2 (cDNA): GRCh38 with ERCCs using path parameter instead of --ont_reads_fq and --ont_reads_txt
nextflow ../main.nf --step 2 \
--path "../ont_data/" \
--ref "../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--annotation "../../references/Homo_sapiens.GRCh38.106_ERCC.gtf" \
--out_dir "./GRCh38_ERCC_test/" \
--cdna_kit "PCS111" \
--is_chm13 "False" \
--track_reads "False" \
--mapq "0" \
--housekeeping "../../references/hg38.HouseKeepingGenes.bed"
nextflow ../main.nf --ont_reads_fq "../../data/ont_data/2023-06-06_brain_directRNA_intronic_reads/*.fastq" \
--ont_reads_txt "../..//data/ont_data/2023-06-06_brain_directRNA_intronic_reads/*.txt" \
--ref "/../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--annotation "../../references/Homo_sapiens.GRCh38.107_ERCC.gtf" \
--housekeeping "../../references/hg38.HouseKeepingGenes.bed" \
--out_dir "./GRCh38_dRNA_EERCC_test/" \
--bambu_track_reads "True" \
--is_dRNA "True" \
--trim_dRNA "True" \
--mapq "0" \
--step "2" \
--is_chm13 "False"
Notice that for GRCh38 the --ercc
is not needed as the user can easily concatenate both the GRCh38 reference and the annotation to the ERCC reference and annotation prior to running the analysis.
nextflow ../main.nf --step 2 \
--bam "./results/GRCh38_ERCC_test/mapping_cDNA/*.bam" \
--bai "./results/GRCh38_ERCC_test/mapping_cDNA/*.bai" \
--ref "../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--annotation "../../references/Homo_sapiens.GRCh38.106_ERCC.gtf" \
--out_dir "./GRCh38_ERCC_test_mapq10_track_reads/" \
--cdna_kit "PCS111" \
--is_chm13 "False" \
--track_reads "True" \
--mapq "10"
nextflow ../main.nf --step 3 \
--bambu_rds "./results/GRCh38_ERCC_test/bambu_prep/*.rds" \
--ref "../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--fai "./results/GRCh38_ERCC_test/fai/*.fai" \
--annotation "../../references/Homo_sapiens.GRCh38.106_ERCC.gtf" \
--is_discovery "True" \
--track_reads "False" \
--NDR "auto" \
--multiqc_input "./results/GRCh38_ERCC_test/multiQC_input/**" \
--multiqc_config "../../references/multiqc_config.yaml" \
--out_dir "./GRCh38_ERCC_test/" \
--intermediate_qc "./results/GRCh38_ERCC_test/intermediate_qc_reports/" \
--is_chm13 "False"
nextflow ../main.nf --step 3 \
--bambu_rds "./results/GRCh38_ERCC_test/bambu_prep/*.rds" \
--ref "../../references/Homo_sapiens.GRCh38_ERCC.fa" \
--fai "./results/GRCh38_ERCC_test/fai/*.fai \
--annotation "../../references/Homo_sapiens.GRCh38.106_ERCC.gtf" \
--is_discovery "True" \
--track_reads "False" \
--NDR "auto" \
--multiqc_input "./results/GRCh38_ERCC_test/multiQC_input/**" \
--multiqc_config "../../references/multiqc_config.yaml" \
--out_dir "./GRCh38_ERCC_test/" \
--is_chm13 "False" -resume