t-neumann · isaacvock · May 21, 2024 · May 21, 2024 · May 22, 2024 · May 22, 2024
diff --git a/doc/source/Basics.rst b/doc/source/Basics.rst
@@ -113,4 +113,46 @@ Here is an example:
    chr2    53217389        53218446        ENSMUSG00000026960      1057    +       0.0268910814471 23.9783295677   407     5169    139     353     118     0
    chr3    95495394        95495567        ENSMUSG00000015522      173     +       0.0290697674419 1.08683646766   53      172     5       16      5       0
    chr3    95495394        95497237        ENSMUSG00000015522      1843    +       0.0253636702723 11.6834920273   584     2681    68      172     55      4
-   chr6    113388777       113389056       ENSMUSG00000079426      279     +       0.0168514412417 16.7780379694   71      2255    38      247     38      3
+   chr6    113388777       113389056       ENSMUSG00000079426      279     +       0.0168514412417 16.7780379694   71      2255    38      247     38      3
+
+
+.. _cB-file:
+
+cB file format
+------------------
+
+The *cB* file is a new, optional output of SLAMDUNK, introduced in version 0.5.0. cB stands for "counts Binomial", and is a tidy table that 
+is designed to support mixture modeling, a statistically rigorous strategy for estimating the fraction of reads from a given UTR that were
+from metabolically labeled reads. This analysis strategy was originally proposed in `Schofield et al., 2018 <https://www.nature.com/articles/nmeth.4582>`_ 
+and implemented in software like `GRAND-SLAM <https://academic.oup.com/bioinformatics/article/34/13/i218/5045735?login=true>`_ and 
+later `bakR <https://rnajournal.cshlp.org/content/29/7/958.abstract>`_. Mixture modeling overcomes the limitations of using a single T>C conversion
+cutoff to classify reads as labeled vs. unlabeled (e.g., RT/sequencing errors in reads from unlabeled RNA, low metabolic label incorporation rates,
+etc.). bakR can be provided a *cB* file as input to perform mixture modeling for you. 
+
+*cB* files are essentially comma-separated text files containing one line entry per group of reads with identical "information content".
+Information content refers to the UTR from which the read originated, as well as the mutational (T>C) and nucleotide content (T) of the read. Thus,
+the columns contained in this file are as follows:
+
+===============  ========  ===================================================================================
+Column           Datatype  Description
+===============  ========  ===================================================================================
+chromosome       String    Chromosome on which the 3' UTR resides
+start            Integer   Start position of the 3' UTR
+end              Integer   End position of the 3' UTR
+name             String    Name or ID of the 3' UTR supplied by the user
+strand           String    Strand of the 3' UTR
+TC               Integer   Number of T>C conversions in the read
+nT               Integer   Number of reference Ts covered by the read
+n                Integer   Number of reads that share all of the information described in the other columns
+===============  ========  ===================================================================================
+
+Here is an example:
+
+.. code:: bash
+
+   chr13   14197734        14199362        ENSMUSG00000039219      +       0  25 10
+   chr13   14197734        14199362        ENSMUSG00000039219      +       0  26 28
+   chr13   14197734        14199362        ENSMUSG00000039219      +       0  28 5
+   chr13   14197734        14199362        ENSMUSG00000039219      +       1  20 3
+   chr13   14197734        14199362        ENSMUSG00000039219      +       1  25 15
+   chr6    113388777       113389056       ENSMUSG00000079426      +       0  30 2
diff --git a/doc/source/Dunks.rst b/doc/source/Dunks.rst
@@ -165,8 +165,8 @@ The *count* dunk calculates all relevant numbers on statistics of SLAMSeq reads
 
 .. code:: bash
 
-     slamdunk count [-h] -o <output directory> [-s <SNP directory>] -r <reference fasta> -b <bed file> [-m]
-                     [-l <maximum read length>] [-q <minimum base quality>] [-t <threads] bam [bam ...]
+     slamdunk count [-h] -o <output directory> [-s <SNP directory>] -r <reference fasta> -b <bed file> [-c <conversion threshold>]
+                     [-m] [-l <maximum read length>] [-q <minimum base quality>] [-t <threads] bam [bam ...]
 
 **Note:** Since QuantSeq is a strand-specific assay, only sense reads will be considered for the final analysis!
 
@@ -182,12 +182,13 @@ File     Description
 
 Output
 ^^^^^^
-==================  =======================================================================================================
-File                Description
-==================  =======================================================================================================
-**Tcount file**     A tab-separated *tcount* file per sample containing the SLAMSeq statistics (see :ref:`tcount-file`).
-**Bedgraph file**   A bedgraph file per sample showing the T->C conversion rate on each covered reference T nucleotide.
-==================  =======================================================================================================
+======================  ==============================================================================================================
+File                    Description
+======================  ==============================================================================================================
+**Tcount file**         A tab-separated *tcount* file per sample containing the SLAMSeq statistics (see :ref:`tcount-file`).
+**Bedgraph file**       A bedgraph file per sample showing the T->C conversion rate on each covered reference T nucleotide.
+**cB file (optional)**  A comma-separated *cB* file per sample containing all of the T->C mutational information (see :ref:`cB-file`).
+======================  ==============================================================================================================
 
 Output files have the same name as the input files with the prefix "_tcount".
 For example::
@@ -206,7 +207,8 @@ Parameter  Required  Description
 **-r**     x         The reference fasta file.
 **-b**     x         BED-file containing coordinates for 3' UTRs.
 **-l**               Maximum read length (will be automatically estimated if not set).
-**-m**               Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted.
+**-c**               Number of T->C conversions in a read required to count it as a "TC" read.
+**-m**               Flag to additionally create a cB.csv file, compatible with mixture modeling.
 **-q**               Minimum base quality for T->C conversions to be counted.
 **-t**               The number of threads to use for this dunk. This dunk runs single-threaded so the number of threads should be equal to the number of available samples.
 **bam**    x         BAM file(s) containing the final filtered reads (wildcard \* accepted).
@@ -273,6 +275,7 @@ Parameter  Required  Description
 **-nm**              Maximum number of mismatches allowed in a read **[filter]**.
 **-mc**              Minimum coverage to call a variant **[snp]**.
 **-mv**              Minimum variant fraction to call a variant **[snp]**.
+**-cb**              Flag to additionally create a cB.csv file, compatible with mixture modeling.
 **-mts**             Flag to activate the multiple T->C conversion stringency: Only T->C conversions in reads with more than 1 T->C conversion will be counted. **[count]**.
 **-rl**              Maximum read length (will be automatically estimated if not set) **[count]**.
 **-mbq**             Minimum base quality for T->C conversions to be counted **[count]**.

diff --git a/doc/source/Usage.rst b/doc/source/Usage.rst
@@ -73,6 +73,9 @@ Calling a module with --help shows all possible parameters text:
                             slamdunk analysis on a cluster (index is 1-based).
       -ss, --skip-sam       Output BAM while mapping. Slower but, uses less hard
                             disk.
+      -cb, --makecB         Output cB.csv file while counting mutations. This file
+                            provides convenient, compressed access to mutational data
+                            compatible with mixture modeling.
 
 The flow of *slamdunk* is to first map your reads, filter your alignments, call variants on your final alignments and use these to calculate conversion rates, counts and various
 statistics for your 3'UTRs.