This analysis pipeline takes de-multiplexed, single- or paired-end, directional RNA-seq reads, tagged with the PCR duplicates’ marker, and produces a map of where these reads originated from in the genome.
STEP1
Fastq files, produced from an Illumina run with the our custom-made adaptors, demoultiplexed are read in, and any residual rRNA contained in this digital sample is removed. The report of this alignment can be found at /scratch/NetID/NNNNNNN.dRNAseq1end.joberr
(or /scratch/NetID/NNNNNNN.dRNAseq2end.joberr
for paired-end)
STEP2
The cleaned-up reads are now aligned with optimized parameters for yeast RNA-seq experiments. Spliced reads are not taken into account in this step. The report of this alignment can be found at /scratch/NetID/NNNNNNN.dRNAseq1end.joberr
(or /scratch/NetID/NNNNNNN.dRNAseq2end.joberr
for paired-end)
STEP3
The few (comparatively) reads that failed to align in the previous step is now tested for splice intron-exon junctions and aligned taking splicing into account. Whatever read still fails to align after this step, is reported in {name}_unaligned_1.fastq
and {name}_unaligned_2.fastq
.
STEP4
The aligned reads from the two processes are concatenated into a single alignment file wich is further trimmed by removing reads with poor mapping quality (MAPQ<15, can not be changed by the user at the moment. If you have concerns you could slightly modify the relevant samtools script within the RNAseq_1end_looseRibo.pbs
or RNAseq_2end_looseRibo.pbs
and save it in a private folder.
STEP5
The last step is the removal of possible PCR duplicates from the report and has three outputs:
a. Possible PCR duplicates are not removed. These alignment report are named {name}_MAPQ15.sam
(the .bam version is also available) and can imported into cufflinks and/or Rsubread/EdgeR for further analysis.
b. Possible PCR duplicates removed with the long-published “coordinates” method. These alignment files are {name}_MAPQ15.noDUP.sam
(the .bam version is also available)
c. [Recommended] Possible PCR duplicates removed with our lab’s new method with custom-made adaptors incorporating PCR-indexes. These alignment files are {name}.rmdup.sam
(the .bam version is also available) .If the “sorted” version is required for certain downstream analysis software use the {name}.rmdup.sort
sam (the .bam version is also available) file.
-
The read files must be de-multiplexed (DGseq demultiplexing) and have the
.fastq
suffix. Everything that precedes the fastq suffix is the name that will identify all the output files produced. -
Log in your NYU HPC account and create a folder containing all the demultiplexed read files you want to analyze. This folder will be the
/source/folder
. -
(Optional) Make a second folder that will become the
/destination/folder
. Alternatively, you can use the same directory every time you are asked for either the/source/folder
or/destination/folder
. -
Go to the folder /data/cgsb/gresham/LABSHARE/SHARED/CODE/. You need to be listed as a member of the Gresham lab, registered with the NYU HPC team managing /data/cgsb/, to be able to read, write or execute the programs in this folder.
-
To submit the job and start the analysis of your results type:
- Single-end reads
qsub –M {your NetID}@nyu.edu RNAseq_1end.pbs –v source={source/folder},destination={ destination/folder}, ref={reference genome}, name={name}
-M [Optional] Enables automatic emails, upon initiation, termination or if the run is aborted (an exit status=0 indicates the job was terminated without problems).
-v There should not be any spaces between n the commas
{source/folder} This is the folder you created in step 2).
{destination/folder} [Optional] This is the folder you created in step 3).
{reference genome} Currently only yeast is supported. The possible values are Scer3 (w/o spike-ins), ScERCC (w/ ERCC spike-ins).
{name} This is everything that precedes the .fastq suffix. This name will follow and identify all the results concerning this sample.
- Paired-end reads
qsub –M {your NetID}@nyu.edu RNAseq_2end.pbs –v source={source/folder},destination={ destination/folder}, ref={reference genome}, name={name}
-M [Optional] Enables automatic emails, upon initiation, termination or if the run is aborted (an exit status=0 indicates the job was terminated without problems).
-v There should not be any spaces between the commas
{source/folder} This is the folder you created in step 2).
{destination/folder} [Optional] This is the folder you created in step 3).
{reference genome} Currently only yeast is supported. The possible values are Scer3 (w/o spike-ins), ScERCC (w/ ERCC spike-ins).
{name} This is everything that precedes the _R1.fastq or _R2.fastq suffix. This name will follow and identify all the results concerning this sample.
-
Two folders within the
/scratch/NetID
of the logged-in user, in the following format:NNNNNNN.dRNAseq_1end.jobout
andNNNNNNN.dRNAseq_1end.joberr
for single-end reads orNNNNNNN.dRNAseq_2end.jobout
andNNNNNNN.dRNAseq_2end.joberr
for paired-end reads. In the previous example, NNNNNNN is the unique job ID allocated automatically by the scheduler. These files contain some useful records of the analysis process and possible alerts. -
At the assigned
/destination/folder/
:
-
The reads that failed to align in
{name}_unaligned_1.fastq
and{name}_unaligned_2.fastq
for each read direction. -
The alignment files containing any PCR duplicate cases that may have occurred in the process. Files
{name}_MAPQ15.bam and {name}_MAPQ15.sam
-
The alignment files after removing PCR duplicates in the “coordinate method” alone
{name}_MAPQ15.noDUP.bam
and{name}_MAPQ15.noDUP.sam
-
[Recommended for further use] The alignment files after removing PCR duplicates in the “index method”
{name}.rmdup.bam
and{name}.rmdup.sam
-
The subfolder
metrics/
that contains useful information about the various stages of the process you should expect tho see the folders{name}.rmdp_stat
,{name}.ALN_Metrics
and{name}.RNA_metrics