Skip to content

Vignette #1 : A "complete" run

Mike Axtell edited this page Sep 12, 2024 · 7 revisions

A "full" run of ShortStack uses all of the possible steps and produces all output files except for bigwig (bw) files (see figure below).

shortstack_workflow_grey_bw

Inputs

  • genome sequence in FASTA format, not yet indexed for bowtie nor for samtools
  • one or more untrimmed FASTQ files of small RNA-seq data.
    • sRNA-seq FASTQ must be single-end, stranded, forward read, such that the first nucleotide of each read is the first nucleotide of the biological small RNA. Most reads are expected to be longer than the small RNA such that the sequencing adapters need to be trimmed from the 3' ends of the reads.
  • A set of known mature microRNA sequences that the user wishes to search for (FASTA format, one sequence line per miRNA).

Example data

This can be obtained from many places, including my server here:

Arabidopsis thaliana genome

curl https://plantsmallrnagenes.science.psu.edu/ath-b10/Arabidopsis_thalianaTAIR10.fa -ko Arabidopsis_thalianaTAIR10.fa

Example small RNA-seq data from Arabidopsis

First install NCBI's sra-tools. Assuming that conda is installed and the Bioconda channel is properly set up (see here), you can create a new environment for sra-tools like so:

conda create sra-tools sra-tools

Then, the three files can be retrieved from SRA like so:

conda activate sra-tools
prefetch SRR3222443 SRR3222444 SRR3222445
fasterq-dump SRR3222443 SRR3222444 SRR3222445

Be sure to switch back to your conda ShortStack environment. For instance, if you named your ShortStack environment "ShortStack4" as suggested in the installation instructions, then:

conda activate ShortStack4

known microRNAs

To get a list of known miRNAs, we will use all miRBase annotated mature miRNAs from miRBase. First, download the mature.fa file from miRBase at https://mirbase.org/download/. Then filter it to get only the ath ones (e.g. the ones from A. thaliana).

grep -A 1 '>ath' mature.fa | grep -v '\-\-' > ath_known_miRNAs.fasta

Execute the full run

In this example 6 threads are requested. Adjust this based on your system.

ShortStack --genomefile Arabidopsis_thalianaTAIR10.fa --readfile SRR3222443.fastq SRR3222444.fastq SRR3222445.fastq --autotrim --threads 6 --outdir ExampleShortStackRun --known_miRNAs ath_known_miRNAs.fasta --dn_mirnas

Command details

  • --autotrim tells ShortStack that the reads from --readfile need to be adapter trimmed, and that ShortStack should infer the adapter sequence. This is the recommended way to trim the adapters off of sRNA-seq data.
  • --known_miRNAs ath_known_miRNAs.fasta is used to guide ShortStack in separating MIRNA loci from other types of small RNA clusters. The sequences provided in this file are first mapped to the reference genome, keeping all exact matches. For each aligned position, ShortStack will first ask whether there are any sRNA-seq reads aligned at that genomic position. If not, there is no evidence for microRNA expression and the search stops. If there are expressed reads matching the known microRNA, ShortStack will look carefully at the predicted secondary structure of a potential precursor transcript. If the secondary structure conforms to accepted patterns (see, for instance, Axtell and Meyers, 2018), then ShortStack looks for microRNA / microRNA* accumulation patterns in the aligned sRNA-seq data. If those patterns pass relative to annotation criteria (Axtell and Meyers, 2018), then the locus is annotated as a MIRNA.
  • --dn_mirnas tells ShortStack to also look for de novo MIRNA loci that may not necessarily be analyzed based on the sequences provided in --known_miRNAs. The same secondary structure and expression criteria apply to these annotations.