updated vignette w all the new arguments and additions

noush-joglekar · Dec 11, 2021 · ece04ca · ece04ca
1 parent e1a5fd2
commit ece04ca
Show file tree

Hide file tree

Showing 5 changed files with 84 additions and 758 deletions.
diff --git a/vignettes/.DS_Store b/vignettes/.DS_Store
diff --git a/vignettes/.gitignore b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/scisorseqr-StandardWorkflow.Rmd b/vignettes/scisorseqr-StandardWorkflow.Rmd
@@ -1,8 +1,10 @@
 ---
-title: "scisorseqr-StandardWorkflow"
-output: pdf_document
-vignette: |
-  %\VignetteIndexEntry{scisorseqr-StandardWorkflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
+title: "scisorseqr: Standard Workflow"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Vignette Title}
+  %\VignetteEngine{knitr::rmarkdown}
+  \usepackage[utf8]{inputenc}
 ---
 
 ```{r setup, include = FALSE}
@@ -14,7 +16,9 @@ knitr::opts_chunk$set(
 
 The *scisorseqr* package allows you to go from short-read, single cell assignments and a set of fastq files to a full-blown cell-type specific analysis of alternative splicing patterns. 
 
-## Software required
+## Setup
+
+### Software required
 
  * [STARlong](https://github.com/alexdobin/STAR/) (and corresponding genome index file)
  * [Minimap2](https://github.com/lh3/minimap2) if not using STARlong
@@ -23,17 +27,27 @@ The *scisorseqr* package allows you to go from short-read, single cell assignmen
  * python version >= 3.7 with the following libraries
    * pandas
    * multiprocessing
-
+
+
+### Quick cheat for viewing documentation
+All the possible arguments, usage, and description for each function can be viewed interactively
+by putting a "?" in front of the function name in the R console like so:
+
+```{r demonstrateMan, include = TRUE, eval = FALSE}
+?GetBarcodes
+```
+
+
 ## Step 1: Barcode deconvolution 
 The first step is to deconvolve barcodes from either the PacBio or ONT fastq output
 For that the user needs to specify
 
  - Tab separated Barcode-Celltype file
  - Directory containing fastq.gz files
 
-### Create a barcode - celltype file assuming analysis using Seurat
+### Create a barcode - celltype file 
+Assuming that short read single-cell analysis has been performed using Seurat:
 ```{r makeBCfile, eval=FALSE, echo=TRUE}
-
 library(Seurat)
 library(tidyr)
 library(dplyr)
@@ -79,13 +93,17 @@ You will see two output directories
 - OutputFiltered
     - file.CSV (One line per barcoded read with the following structure:)
 
-|ReadName |T9_Status |Strand|T9_position|Barcodes|
-|:----------|:-----|:---|:--|:----------------|
-|ReadX|poly_T_found|rev|47|GTCGGGTTCCAAAGTC|
+|ReadName |T9_Status |Strand|T9_pos|Barcodes|BC_pos|Cluster|UMI|TSO_Status|TSO_pos|Length|
+|:----------|:-----|:---|:--|:----------------|:--|:-----------|:----------|:---------|:--|:----|
+|ReadX|poly_T_found|rev|47|GTCGGGTTCCAAAGTC|22|Hipp_GranuleNB|ATGGCGGGAT|TSO_found|26|913|
+|ReadY|poly_T_found|fwd|48|AGTGCCGATAGTTATG|22|Hipp_Astro|GGGATATGGC|TSO_found|26|1312|
 
-|Barcode_position|Cluster|UMI|TSO_Status|TSO_position|Length|
-|:--|:-----------|:----------|:---------|:--|:----|
-|22|Hipp_GranuleNB|ATGGCGGGAT|TSO_found|26|583|
+<br>
+
+**NOTE 1** If you want to limit the downstream analysis to barcoded reads only, set the 'filterReads' parameter to TRUE. This will output a fastq.gz file in the OutputFiltered directory.
+<br>
+
+**NOTE 2** Setting 'concatenate' to FALSE ensures that your input files are treated as separate samples.
 
 ## Step 2: Align your fastq files to the reference
 
@@ -97,54 +115,78 @@ We have made two aligners available and compatible with our package
 ```{r STARalign, eval = FALSE, echo = TRUE}
 
 ## Load example data
-fastqFolder <- system.file("extdata/", "userInput/FastqFiles", 
+fastqFolder <- system.file("extdata/", "userInput/FastqFiles/", 
                            package = "scisorseqr")
 
 STARalign(fqFolder = fastqFolder, starProgPath = '~/bin/Linux_x86_64/STARlong',
-          refGenomeIndex = '~/starIndex_gencode10_sequins/',
+          refGenomeIndex = '~/starIndex_gencode10/',
           numThreads = 24)
 
 MMalign(fqFolder = fastqFolder, mmProgPath = '~/minimap2/minimap2',
         refGenome='~/genomes/mm10.fa',
         numThreads = 16)
 ```
 
-This process will convert all the aligned sam files to bam files, and dump the output into *STARoutput*
+This process will convert all the aligned sam files to bam files, and dump the output along with run reports into *STARoutput* or *MMoutput*
 
 ## Step 3: Map and filter for full-length, spliced reads
 We used annotated [CAGE peaks](https://fantom.gsc.riken.jp/5/datafiles/reprocessed/mm10_latest/extra/CAGE_peaks/) and [PolyA sites](https://polyasite.unibas.ch/atlas) for our study
-```{r MapAndFilter, eval = FALSE, echo = TRUE}
 
+```{r MapAndFilter, eval = FALSE, echo = TRUE}
 MapAndFilter(annoGZ='[PATH_TO_annotation.gtf.gz]',numThreads=24, 
-             seqDir = '[PATH_TO_DIR/chr*.fa.gz]',
+             seqDir = '[PATH_TO_GENOME_DIR/chr*.fa.gz]',
              filterFullLength=TRUE,
              cageBed = '[PATH_TO_Cage.bed.gz]',
              polyABed = '[PATH_TO_PolyA.bed.gz]',
-             cp_distance=50)
+             cp_distance=50, genomeVersion="mm10")
 
 ```
-**NOTE**  the parameter seqDir takes in a reference genome broken down by chromosome so
+
+**NOTE 1:**  the parameter seqDir takes in a reference genome broken down by chromosome so
 as to parallelize the process.
 This function has not yet been made generalizable to loading in the full reference
+<br>
 
-This will result in multiple files being stored in yet another output directory: *LRProcessingOutput*
+**NOTE 2:** The genome and version is not auto-detected and hence defaults to 'mm10'. It does not affect any processing but might lead to confusion if processing a different species.
+<br>
+
+This will result in multiple files being stored in yet another output directory: *LRProcessingOutput*. A lot of these files
+are not super relevant and there is an option to remove them during Step 4.
 
 ## Step 4: Concatenate all the information collected so far
 
 ```{r InfoPerLongRead, eval=FALSE, echo=TRUE}
- 
 InfoPerLongRead(barcodeOutputFile = 
                   'OutputFiltered/FilteredDeconvBC_AllFiles.csv',
                 mapAndFilterOut = 'LRProcessingOutput/',
-                minTimesIsoObserve = 3)
+                minTimesIsoObserve = 5, rmTmpFolder = TRUE)
 
 ```
 
-This will yield a directory *LongReadInfo* with file **AllInfo** which has the following structure
+This will yield a directory *LongReadInfo*. Along with some files containing basic statistics on a per-barcode basis, this folder contains two main files: 
 
-|ReadName|Gene|Celltype|Barcode|UMI|IntronChain|Status|# Introns|
-|:-------|:---|:-------|:------|:--|:----------|:-----|:--------|
+* *AllInfo.gz*: Concatenated information for all poly-exonic, spliced, barcoded, mapped, and 5' **and** 3' complete reads
+* *AllInfo_Incomplete.gz*: Same as above, except with a relaxed criterion for the 5' and 3' completeness
+<br>
 
+If annotated CAGE and PolyA peaks are not provided, then only an AllInfo file is generated, and the reads obviously do not contain information on assigned peaks
+<br>
+<br>
+
+The *AllInfo.gz* file has the following structure, where Intron Chain contains the assigned TSS and PolyA site:
+
+|ReadName|Gene|Celltype|Barcode|UMI|IntronChain|ExonChain|Status|# Introns|
+|:-------|:---|:-------|:------|:--|:-----------|:-------|:-----|:--------|
+<br>
+
+The *AllInfo_Incomplete.gz* file has the following structure, where the intron chain purely contains a string of introns. If a CAGE peak or PolyA site are not assigned, then the status is indicated as "NoTSS" or "NoPolyA" in the 7th and 8th column respectively
+
+|ReadName|Gene|Celltype|Barcode|UMI|IntronChain|TSS|PolyA|ExonChain|Status|# Introns|
+|:-------|:---|:-------|:------|:--|:-----------|:--|:---|:-------|:-----|:--------|
+
+<br>
+
+**NOTE:** In case you want to retain all the temporary files and / or re-run this step, set the argument 'rmTmpFolder' to FALSE.
 
 ## Step 5a: Quantify the various isoforms  observed in your data
 If you provided CAGE and PolyA peaks at the MapAndFilter() stage, you can choose to quantify unique 
@@ -156,7 +198,7 @@ IsoQuant(AllInfoFile = 'LongReadInfo/AllInfo',
 
 ```
 
-Depending on which modality you chose as *TRUE*, you will see the following files in the  *IsoQuantOutput* folder:
+Depending on which modality you chose as *TRUE*, you will see the following files in the *IsoQuantOutput* folder:
 
 - *Modality*-*Modality*ID.csv
 - Num*Modality*PerCluster
@@ -165,7 +207,7 @@ These will be used for the differential splicing analysis step
 
 ## Step 5b: [OPTIONAL]  Quantify the various exons observed in your data
 For ONT reads we don't have too much confidence in this method and recommend using 
-our sister package IsoQuant. 
+our sister package [IsoQuant](https://github.com/ablab/isoquant). 
 
 This function however, works for annotated as well as novel exons
 ```{r ExonQuant, echo = TRUE, eval= FALSE}
@@ -247,14 +289,18 @@ condensedCellTypes = system.file("extdata/", "userInput/condensedCellTypes",
                       package = "scisorseqr")
 
 triHeatmap(treeDir = 'TreeTraversal_Iso/',
-           comparisonList =condensedCellTypes, 
-           outName =condensedCellTypes)
+           comparisonList = condensedCellTypes, 
+           typeOfTest = "Iso",
+           outName = condensedCellTypes)
 ```
 
-**NOTE** Cell-type labels should correspond to the barcode-celltype list. If you input a cell-type for which pairwise DIE has not been calculated, it will output all zeros
+This was not quite designed for visualizing differential exon analysis output and thus may give some errors. Please do not hesitate to reach out or open an issue if there is difficulty with this stuff.
+
+**NOTE 1:** Cell-type labels should correspond to the barcode-celltype list i.e. the way they appear in the AllInfo files. If you input a cell-type for which pairwise DIE has not been calculated, it will output all zeros
 (e.g. DivOligo in this case)
 <br>
-**NOTE** The example data included in this package does not yield the plot below. This is just an example plot generated from the full dataset.
+
+**NOTE 2:** The example data included in this package does not yield the plot below. This is just an example plot generated from the full dataset.
 
 ```{r heatmap, out.width = '60%', echo=F}
 knitr::include_graphics("../man/figures/condensedCellTypes.png")
@@ -273,9 +319,10 @@ knitr::include_graphics("../man/figures/examplePie.png")
 
 Each number in the legend corresponds to the number of isoforms needed to cross the deltaPI = 0.1 threshold
 <br>
-**NOTE** Percentage labels do not always arrange themselves neatly in the figure (requires editing by hand)
+**NOTE 1:** Percentage labels do not always arrange themselves neatly in the figure (requires editing by hand)
 <br>
-**NOTE** The example data included in this package does not yield the plot above. This is just an example plot generated from the full dataset.
+
+**NOTE 2:** The example data included in this package does not yield the plot above. This is just an example plot generated from the full dataset.
 
 ## Done!