Skip to content

Commit

Permalink
updated vignette w all the new arguments and additions
Browse files Browse the repository at this point in the history
  • Loading branch information
Anoushka committed Dec 11, 2021
1 parent e1a5fd2 commit ece04ca
Show file tree
Hide file tree
Showing 5 changed files with 84 additions and 758 deletions.
Binary file modified vignettes/.DS_Store
Binary file not shown.
2 changes: 2 additions & 0 deletions vignettes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.html
*.R
117 changes: 82 additions & 35 deletions vignettes/scisorseqr-StandardWorkflow.Rmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
---
title: "scisorseqr-StandardWorkflow"
output: pdf_document
vignette: |
%\VignetteIndexEntry{scisorseqr-StandardWorkflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
title: "scisorseqr: Standard Workflow"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
Expand All @@ -14,7 +16,9 @@ knitr::opts_chunk$set(

The *scisorseqr* package allows you to go from short-read, single cell assignments and a set of fastq files to a full-blown cell-type specific analysis of alternative splicing patterns.

## Software required
## Setup

### Software required

* [STARlong](https://github.com/alexdobin/STAR/) (and corresponding genome index file)
* [Minimap2](https://github.com/lh3/minimap2) if not using STARlong
Expand All @@ -23,17 +27,27 @@ The *scisorseqr* package allows you to go from short-read, single cell assignmen
* python version >= 3.7 with the following libraries
* pandas
* multiprocessing



### Quick cheat for viewing documentation
All the possible arguments, usage, and description for each function can be viewed interactively
by putting a "?" in front of the function name in the R console like so:

```{r demonstrateMan, include = TRUE, eval = FALSE}
?GetBarcodes
```


## Step 1: Barcode deconvolution
The first step is to deconvolve barcodes from either the PacBio or ONT fastq output
For that the user needs to specify

- Tab separated Barcode-Celltype file
- Directory containing fastq.gz files

### Create a barcode - celltype file assuming analysis using Seurat
### Create a barcode - celltype file
Assuming that short read single-cell analysis has been performed using Seurat:
```{r makeBCfile, eval=FALSE, echo=TRUE}
library(Seurat)
library(tidyr)
library(dplyr)
Expand Down Expand Up @@ -79,13 +93,17 @@ You will see two output directories
- OutputFiltered
- file.CSV (One line per barcoded read with the following structure:)

|ReadName |T9_Status |Strand|T9_position|Barcodes|
|:----------|:-----|:---|:--|:----------------|
|ReadX|poly_T_found|rev|47|GTCGGGTTCCAAAGTC|
|ReadName |T9_Status |Strand|T9_pos|Barcodes|BC_pos|Cluster|UMI|TSO_Status|TSO_pos|Length|
|:----------|:-----|:---|:--|:----------------|:--|:-----------|:----------|:---------|:--|:----|
|ReadX|poly_T_found|rev|47|GTCGGGTTCCAAAGTC|22|Hipp_GranuleNB|ATGGCGGGAT|TSO_found|26|913|
|ReadY|poly_T_found|fwd|48|AGTGCCGATAGTTATG|22|Hipp_Astro|GGGATATGGC|TSO_found|26|1312|

|Barcode_position|Cluster|UMI|TSO_Status|TSO_position|Length|
|:--|:-----------|:----------|:---------|:--|:----|
|22|Hipp_GranuleNB|ATGGCGGGAT|TSO_found|26|583|
<br>

**NOTE 1** If you want to limit the downstream analysis to barcoded reads only, set the 'filterReads' parameter to TRUE. This will output a fastq.gz file in the OutputFiltered directory.
<br>

**NOTE 2** Setting 'concatenate' to FALSE ensures that your input files are treated as separate samples.

## Step 2: Align your fastq files to the reference

Expand All @@ -97,54 +115,78 @@ We have made two aligners available and compatible with our package
```{r STARalign, eval = FALSE, echo = TRUE}
## Load example data
fastqFolder <- system.file("extdata/", "userInput/FastqFiles",
fastqFolder <- system.file("extdata/", "userInput/FastqFiles/",
package = "scisorseqr")
STARalign(fqFolder = fastqFolder, starProgPath = '~/bin/Linux_x86_64/STARlong',
refGenomeIndex = '~/starIndex_gencode10_sequins/',
refGenomeIndex = '~/starIndex_gencode10/',
numThreads = 24)
MMalign(fqFolder = fastqFolder, mmProgPath = '~/minimap2/minimap2',
refGenome='~/genomes/mm10.fa',
numThreads = 16)
```

This process will convert all the aligned sam files to bam files, and dump the output into *STARoutput*
This process will convert all the aligned sam files to bam files, and dump the output along with run reports into *STARoutput* or *MMoutput*

## Step 3: Map and filter for full-length, spliced reads
We used annotated [CAGE peaks](https://fantom.gsc.riken.jp/5/datafiles/reprocessed/mm10_latest/extra/CAGE_peaks/) and [PolyA sites](https://polyasite.unibas.ch/atlas) for our study
```{r MapAndFilter, eval = FALSE, echo = TRUE}

```{r MapAndFilter, eval = FALSE, echo = TRUE}
MapAndFilter(annoGZ='[PATH_TO_annotation.gtf.gz]',numThreads=24,
seqDir = '[PATH_TO_DIR/chr*.fa.gz]',
seqDir = '[PATH_TO_GENOME_DIR/chr*.fa.gz]',
filterFullLength=TRUE,
cageBed = '[PATH_TO_Cage.bed.gz]',
polyABed = '[PATH_TO_PolyA.bed.gz]',
cp_distance=50)
cp_distance=50, genomeVersion="mm10")
```
**NOTE** the parameter seqDir takes in a reference genome broken down by chromosome so

**NOTE 1:** the parameter seqDir takes in a reference genome broken down by chromosome so
as to parallelize the process.
This function has not yet been made generalizable to loading in the full reference
<br>

This will result in multiple files being stored in yet another output directory: *LRProcessingOutput*
**NOTE 2:** The genome and version is not auto-detected and hence defaults to 'mm10'. It does not affect any processing but might lead to confusion if processing a different species.
<br>

This will result in multiple files being stored in yet another output directory: *LRProcessingOutput*. A lot of these files
are not super relevant and there is an option to remove them during Step 4.

## Step 4: Concatenate all the information collected so far

```{r InfoPerLongRead, eval=FALSE, echo=TRUE}
InfoPerLongRead(barcodeOutputFile =
'OutputFiltered/FilteredDeconvBC_AllFiles.csv',
mapAndFilterOut = 'LRProcessingOutput/',
minTimesIsoObserve = 3)
minTimesIsoObserve = 5, rmTmpFolder = TRUE)
```

This will yield a directory *LongReadInfo* with file **AllInfo** which has the following structure
This will yield a directory *LongReadInfo*. Along with some files containing basic statistics on a per-barcode basis, this folder contains two main files:

|ReadName|Gene|Celltype|Barcode|UMI|IntronChain|Status|# Introns|
|:-------|:---|:-------|:------|:--|:----------|:-----|:--------|
* *AllInfo.gz*: Concatenated information for all poly-exonic, spliced, barcoded, mapped, and 5' **and** 3' complete reads
* *AllInfo_Incomplete.gz*: Same as above, except with a relaxed criterion for the 5' and 3' completeness
<br>

If annotated CAGE and PolyA peaks are not provided, then only an AllInfo file is generated, and the reads obviously do not contain information on assigned peaks
<br>
<br>

The *AllInfo.gz* file has the following structure, where Intron Chain contains the assigned TSS and PolyA site:

|ReadName|Gene|Celltype|Barcode|UMI|IntronChain|ExonChain|Status|# Introns|
|:-------|:---|:-------|:------|:--|:-----------|:-------|:-----|:--------|
<br>

The *AllInfo_Incomplete.gz* file has the following structure, where the intron chain purely contains a string of introns. If a CAGE peak or PolyA site are not assigned, then the status is indicated as "NoTSS" or "NoPolyA" in the 7th and 8th column respectively

|ReadName|Gene|Celltype|Barcode|UMI|IntronChain|TSS|PolyA|ExonChain|Status|# Introns|
|:-------|:---|:-------|:------|:--|:-----------|:--|:---|:-------|:-----|:--------|

<br>

**NOTE:** In case you want to retain all the temporary files and / or re-run this step, set the argument 'rmTmpFolder' to FALSE.

## Step 5a: Quantify the various isoforms observed in your data
If you provided CAGE and PolyA peaks at the MapAndFilter() stage, you can choose to quantify unique
Expand All @@ -156,7 +198,7 @@ IsoQuant(AllInfoFile = 'LongReadInfo/AllInfo',
```

Depending on which modality you chose as *TRUE*, you will see the following files in the *IsoQuantOutput* folder:
Depending on which modality you chose as *TRUE*, you will see the following files in the *IsoQuantOutput* folder:

- *Modality*-*Modality*ID.csv
- Num*Modality*PerCluster
Expand All @@ -165,7 +207,7 @@ These will be used for the differential splicing analysis step

## Step 5b: [OPTIONAL] Quantify the various exons observed in your data
For ONT reads we don't have too much confidence in this method and recommend using
our sister package IsoQuant.
our sister package [IsoQuant](https://github.com/ablab/isoquant).

This function however, works for annotated as well as novel exons
```{r ExonQuant, echo = TRUE, eval= FALSE}
Expand Down Expand Up @@ -247,14 +289,18 @@ condensedCellTypes = system.file("extdata/", "userInput/condensedCellTypes",
package = "scisorseqr")
triHeatmap(treeDir = 'TreeTraversal_Iso/',
comparisonList =condensedCellTypes,
outName =condensedCellTypes)
comparisonList = condensedCellTypes,
typeOfTest = "Iso",
outName = condensedCellTypes)
```

**NOTE** Cell-type labels should correspond to the barcode-celltype list. If you input a cell-type for which pairwise DIE has not been calculated, it will output all zeros
This was not quite designed for visualizing differential exon analysis output and thus may give some errors. Please do not hesitate to reach out or open an issue if there is difficulty with this stuff.

**NOTE 1:** Cell-type labels should correspond to the barcode-celltype list i.e. the way they appear in the AllInfo files. If you input a cell-type for which pairwise DIE has not been calculated, it will output all zeros
(e.g. DivOligo in this case)
<br>
**NOTE** The example data included in this package does not yield the plot below. This is just an example plot generated from the full dataset.

**NOTE 2:** The example data included in this package does not yield the plot below. This is just an example plot generated from the full dataset.

```{r heatmap, out.width = '60%', echo=F}
knitr::include_graphics("../man/figures/condensedCellTypes.png")
Expand All @@ -273,9 +319,10 @@ knitr::include_graphics("../man/figures/examplePie.png")

Each number in the legend corresponds to the number of isoforms needed to cross the deltaPI = 0.1 threshold
<br>
**NOTE** Percentage labels do not always arrange themselves neatly in the figure (requires editing by hand)
**NOTE 1:** Percentage labels do not always arrange themselves neatly in the figure (requires editing by hand)
<br>
**NOTE** The example data included in this package does not yield the plot above. This is just an example plot generated from the full dataset.

**NOTE 2:** The example data included in this package does not yield the plot above. This is just an example plot generated from the full dataset.

## Done!

Loading

0 comments on commit ece04ca

Please sign in to comment.