Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cB file information to documentation #156

Open
wants to merge 5 commits into
base: gh-pages
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion doc/source/Basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,4 +113,46 @@ Here is an example:
chr2 53217389 53218446 ENSMUSG00000026960 1057 + 0.0268910814471 23.9783295677 407 5169 139 353 118 0
chr3 95495394 95495567 ENSMUSG00000015522 173 + 0.0290697674419 1.08683646766 53 172 5 16 5 0
chr3 95495394 95497237 ENSMUSG00000015522 1843 + 0.0253636702723 11.6834920273 584 2681 68 172 55 4
chr6 113388777 113389056 ENSMUSG00000079426 279 + 0.0168514412417 16.7780379694 71 2255 38 247 38 3
chr6 113388777 113389056 ENSMUSG00000079426 279 + 0.0168514412417 16.7780379694 71 2255 38 247 38 3


.. _cB-file:

cB file format
------------------

The *cB* file is a new, optional output of SLAMDUNK, introduced in version 0.5.0. cB stands for "counts Binomial", and is a tidy table that
is designed to support mixture modeling, a statistically rigorous strategy for estimating the fraction of reads from a given UTR that were
from metabolically labeled reads. This analysis strategy was originally proposed in `Schofield et al., 2018 <https://www.nature.com/articles/nmeth.4582>`_
and implemented in software like `GRAND-SLAM <https://academic.oup.com/bioinformatics/article/34/13/i218/5045735?login=true>`_ and
later `bakR <https://rnajournal.cshlp.org/content/29/7/958.abstract>`_. Mixture modeling overcomes the limitations of using a single T>C conversion
cutoff to classify reads as labeled vs. unlabeled (e.g., RT/sequencing errors in reads from unlabeled RNA, low metabolic label incorporation rates,
etc.). bakR can be provided a *cB* file as input to perform mixture modeling for you.

*cB* files are essentially comma-separated text files containing one line entry per group of reads with identical "information content".
Information content refers to the UTR from which the read originated, as well as the mutational (T>C) and nucleotide content (T) of the read. Thus,
the columns contained in this file are as follows:

=============== ======== ===================================================================================
Column Datatype Description
=============== ======== ===================================================================================
chromosome String Chromosome on which the 3' UTR resides
start Integer Start position of the 3' UTR
end Integer End position of the 3' UTR
name String Name or ID of the 3' UTR supplied by the user
strand String Strand of the 3' UTR
TC Integer Number of T>C conversions in the read
nT Integer Number of reference Ts covered by the read
n Integer Number of reads that share all of the information described in the other columns
=============== ======== ===================================================================================

Here is an example:

.. code:: bash

chr13 14197734 14199362 ENSMUSG00000039219 + 0 25 10
chr13 14197734 14199362 ENSMUSG00000039219 + 0 26 28
chr13 14197734 14199362 ENSMUSG00000039219 + 0 28 5
chr13 14197734 14199362 ENSMUSG00000039219 + 1 20 3
chr13 14197734 14199362 ENSMUSG00000039219 + 1 25 15
chr6 113388777 113389056 ENSMUSG00000079426 + 0 30 2
21 changes: 12 additions & 9 deletions doc/source/Dunks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,8 @@ The *count* dunk calculates all relevant numbers on statistics of SLAMSeq reads

.. code:: bash

slamdunk count [-h] -o <output directory> [-s <SNP directory>] -r <reference fasta> -b <bed file> [-m]
[-l <maximum read length>] [-q <minimum base quality>] [-t <threads] bam [bam ...]
slamdunk count [-h] -o <output directory> [-s <SNP directory>] -r <reference fasta> -b <bed file> [-c <conversion threshold>]
[-m] [-l <maximum read length>] [-q <minimum base quality>] [-t <threads] bam [bam ...]

**Note:** Since QuantSeq is a strand-specific assay, only sense reads will be considered for the final analysis!

Expand All @@ -182,12 +182,13 @@ File Description

Output
^^^^^^
================== =======================================================================================================
File Description
================== =======================================================================================================
**Tcount file** A tab-separated *tcount* file per sample containing the SLAMSeq statistics (see :ref:`tcount-file`).
**Bedgraph file** A bedgraph file per sample showing the T->C conversion rate on each covered reference T nucleotide.
================== =======================================================================================================
====================== ==============================================================================================================
File Description
====================== ==============================================================================================================
**Tcount file** A tab-separated *tcount* file per sample containing the SLAMSeq statistics (see :ref:`tcount-file`).
**Bedgraph file** A bedgraph file per sample showing the T->C conversion rate on each covered reference T nucleotide.
**cB file (optional)** A comma-separated *cB* file per sample containing all of the T->C mutational information (see :ref:`cB-file`).
====================== ==============================================================================================================

Output files have the same name as the input files with the prefix "_tcount".
For example::
Expand All @@ -206,7 +207,8 @@ Parameter Required Description
**-r** x The reference fasta file.
**-b** x BED-file containing coordinates for 3' UTRs.
**-l** Maximum read length (will be automatically estimated if not set).
**-m** Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted.
**-c** Number of T->C conversions in a read required to count it as a "TC" read.
**-m** Flag to additionally create a cB.csv file, compatible with mixture modeling.
**-q** Minimum base quality for T->C conversions to be counted.
**-t** The number of threads to use for this dunk. This dunk runs single-threaded so the number of threads should be equal to the number of available samples.
**bam** x BAM file(s) containing the final filtered reads (wildcard \* accepted).
Expand Down Expand Up @@ -273,6 +275,7 @@ Parameter Required Description
**-nm** Maximum number of mismatches allowed in a read **[filter]**.
**-mc** Minimum coverage to call a variant **[snp]**.
**-mv** Minimum variant fraction to call a variant **[snp]**.
**-cb** Flag to additionally create a cB.csv file, compatible with mixture modeling.
**-mts** Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted. **[count]**.
**-rl** Maximum read length (will be automatically estimated if not set) **[count]**.
**-mbq** Minimum base quality for T->C conversions to be counted **[count]**.
Expand Down
3 changes: 3 additions & 0 deletions doc/source/Usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ Calling a module with --help shows all possible parameters text:
slamdunk analysis on a cluster (index is 1-based).
-ss, --skip-sam Output BAM while mapping. Slower but, uses less hard
disk.
-cb, --makecB Output cB.csv file while counting mutations. This file
provides convenient, compressed access to mutational data
compatible with mixture modeling.

The flow of *slamdunk* is to first map your reads, filter your alignments, call variants on your final alignments and use these to calculate conversion rates, counts and various
statistics for your 3'UTRs.
Expand Down