-
Notifications
You must be signed in to change notification settings - Fork 1
GRAND SLAM
Globally refined analysis of newly transcribed RNA and decay rates using SLAM-seq (GRAND-SLAM) is a computational approach to infer the proportion and the corresponding posterior distribution of new and old RNA for each gene from SLAM-seq experiments (or similar approaches based on metabolic labeling and chemical nucleoside conversion,e.g. this or this ).
To use GRAND-SLAM, you must do the following
- Install Gedi (see here)
- Prepare a genome (once per genome)
- Run it (once per data set)
Assuming you are able to call the gedi
script:
gedi -e IndexGenome -organism mus_musculus -version 90 -p -nobowtie -nostar -nokallisto
This will download the genomic sequence (fasta file) as well as the corresponding annotation (gtf file) from ensembl and do some steps to prepare it for usage in GRAND-SLAM. Its name for referencing it will be mus_musculus.90. For details about preparing genomes, see here
Assuming you are able to call the gedi
script and have mapped reads from a SLAM-seq experiment (in 24h.cit, which you can download from here):
gedi -e Slam -genomic mus_musculus.90 -prefix test/24h -progress -plot -D -reads 24h.cit
Always make sure, that the given genome is the same as used to map reads.
Here, read mappings are in CIT format. You can also use a bam
file or a bamlist
file instead. However, for performance reasons, we strongly recommend converting to CIT first! For more documentation on how mapped reads can be specified, see here.
Important: CIT is generally much more efficient than several bam files. We strongly recommend to convert your bam files to CIT first! This can be done by using the bash script bamlist2cit (bamlist2cit [-p] xyz.bamlist
) that is included in the GRAND-SLAM distribution.
For more options (e.g. discard 5' end of reads), type:
gedi -e Slam -hh
You can find more example data set and detailed instruction how they were processed here:
- GRAND-SLAM analysis of mESCs treated with 4sU for 24h from Herzog et al., Nature Methods 2017 (https://doi.org/10.1038/nmeth.4435)
- GRAND-SLAM analysis of acute BANP depletion from Grand et al., Nature 2021 (https://doi.org/10.1038/s41586-021-03689-8)
- GRAND-SLAM analysis of NXF1 knockout cells from Zuckerman et al., Mol Cell 2020 (https://doi.org/10.1016/j.molcel.2020.05.013)
- GRAND-SLAM analysis of SARS-CoV-2 data from Finkel et al., Nature 2021 (https://www.nature.com/articles/s41586-021-03610-3)
In ${prefix}.tsv you will find the main output table, with all gene-wise information extracted from the data. The two central parameters GRAND-SLAM provides are the read count and the new-to-total RNA ratio (NTR; we introduced this abbreviation in the scSLAM-seq Nature paper, it was called πg in the original GRAND-SLAM Bioinformatics paper). Both are based on reads that are compatible with some transcript of the gene (i.e. the read could have been generated from an mRNA). Specifically, this means that intronic reads are not considered (as they would bias the NTR).
Column | Description |
---|---|
Gene | The Ensembl gene id |
Symbol | The gene symbol |
XXX Readcount | The total number of reads mapped to this gene in condition XXX; these are not normalized, and multimapping reads are counted fractionally (i.e. a read with two mapping locations is counted as 0.5 in both) |
XXX 0.05 quantile | The lower bound of the posterior distribution for the NTR |
XXX Mean | The mean of the posterior distribution for the NTR |
XXX MAP | The mode of the posterior distribution for the NTR (this should usually be used as the NTR) |
XXX 0.95 quantile | The upper bound of the posterior distribution for the NTR |
XXX alpha | The alpha parameter of the beta approximation of the posterior distribution for the NTR |
XXX beta | The beta parameter of the beta approximation of the posterior distribution for the NTR |
XXX Conversions * | The total number of conversions in this gene |
XXX Coverage * | The total number of U covered by any reads (if all U were converted, the Conversions=Coverage) |
XXX Double-Hits * | The same as Conversions, but only counting in doubly sequenced bases (paired-end) |
XXX Double-Hit Coverage * | The same as Coverage, but only counting in doubly sequenced bases (paired-end) |
XXX min2 * | The number of reads with at least two characteristic mismatches |
Length | The length of the longest transcript for this gene |
- this is only available if
gedi -e Slam
has been called with the-full
parameter!
We recommend reading the section More explanation on GRAND-SLAM!
- ${prefix}.mismatches.pdf: The pages entitled Exonic first and Exonic second contains mismatch statistics for all first and second reads mapped to exons, respectively (whether they may be sense or antisense reads). The next pages ("ExonicSense First"/"ExonicSense Second"/"ExonicAntisense First"/"ExonicAntisense Second") contain the same statistics distinguishing sense and antisense reads. The next few pages show the same info for intronic reads. The remaining pages show the same information but now ordered by mismatch type. Depending on the library preparation protocol / sequencing mode used, some pages may either be omitted or show no data.
- ${prefix}.double.pdf: The same info but here only for doubly sequenced base pairs. The mismatch types w.r.t to the sense strand sequence are shown.
-
${prefix}.mismatchpos.pdf: Here the mismatches are shown for each position in a read or read pair. For 2x 100bp read pairs, the x axis spans 0 through 200, the first 100 correspond to the first read, 100-200 to the second read (where 200 is the 5' end). The bases indicated on top of the plots is the genomic nucleotide, the color encodes the sequenced nucleotide. Opposite=0 means sense, opposite=1 means antisense. Overlap=0 means outside of any doubly sequenced part, overlap=1 means within the doubly sequenced part. This file in particular is useful to identify artifacts at the beginning or end of the reads (to be excluded from analysis by the
-trim5p
and-trim3p
parameters). - ${prefix}.mismatchposzoomed.pdf: The same as ${prefix}.mismatchpos.pdf but the y axis is scaled to more relevant values.
- ${prefix}.binomRates.png: The probabilities of observing T-C mismatches in old and new RNA for each sample. This is the direct result of the EM algorithm for estimating pc described in the paper. The status shown indicates whether the rates could be reliably estimated.
- ${prefix}.binomNewProportion.png: The estimated total percentage of new RNA in each sample. Note that this is fundamentally different from the data shown in binomRates.png: The binomRates only depend on the efficiency of 4sU incorporation and conversion (i.e. it is a technical parameter). The new proportion has nothing to do with 4sU incorporation but only indicates how much of the RNA is new (i.e. it is a biological parameter).
- ${prefix}.single_new.png and ${prefix}.single_old.png: The probabilities of T-C mismatches in new and old RNA, respectively (is the same as .binomRates.png if not overridden by command line parameter).
- ${prefix}.double_new.png and ${prefix}.single_old.png: The probabilities of doubly sequenced T-C mismatches in new and old RNA, respectively.
The most important thing for the mapped reads is that the MD
attribute is reported (e.g. for STAR: use --outSAMattributes MD
), as this is how mismatches in read mapping are encoded in SAM. We also highly recommend to include the NH
attribute --outSAMattributes MD NH
, which tells GRAND-SLAM about the number of mapping positions for each read. In addition, we prefer reads mapped without softclipping (e.g. for STAR: --alignEndsType EndToEnd
) and using the -trim5p
and -trim3p
parameters to generally remove artifactual patterns of mismatches at the ends of the reads.
Especially if reads have been mapped in end to end mode, there may be artifactual patterns of mismatches at the very 5' end. It is wise to inspect the .mismatchpos.pdf
files generated by GRAND-SLAM, and to use the -trim5p
parameter to remove the first bases, if necessary!
As T>C SNPs would influence on the estimated new/total ratios, they are first filtered out. Thus, it makes perfect sense to run GRAND-SLAM on all samples from the experiment (on the same cells). This can be done by either putting them into a CIT file, or by creating a bamlist file (see here] for more information).
Make sure that you either have samples without 4sU to let GRAND-SLAM estimate the regression model from these data, or to have the regression model for you protocol/sequencer. For the former, make sure that the names of the conditions (i.e. given in the metadata of the CIT file, or by the name of the BAM file) include "no4sU" (or, alternatively, use the -no4sUpattern
parameter to tell GRAND-SLAM how to identify them). For the latter, specify it via the -errlm
parameter.
For experiments based on metabolic labeling and chemical nucleoside conversion, several considerations about the observed reads are important:
- A (single-end) read can be sense or antisense. It is sense, if it has the same orientation as the mRNA, and it is antisense, if it is the reverse-complement of the mRNA sequence. A sequencing protocol may preserve strand specificity and produce (more or less) only sense reads (e.g. Quant-seq) or antisense reads (current TruSeq protocols). It may also lose strand-specificity (e.g. SMART-seq), producing both sense and antisense reads.
- In paired-end sequencing, there is a first and a second read. The first read is the one where the sequencing reaction is initiated at the Read1 primer, the second read the one initiated at the Read2 primer. In general, one read is sense, the other one is antisense. Here, we define a read pair to be sense, if the first read is sense, and antisense if the first read is antisense.
- Libraries from any protocol can be sequencing either in single-end mode, or in paired-end mode. Thus, there are four kinds of reads produced by a strand-unspecific protocol in paired-end mode: a) First read, sense; b) Second read, antisense; c) First read, antisense; d) Second read, sense. Normally, a and b occur in a read pair, and c and d occur in a read pair.
- If the insert (the part of the DNA library corresponding to RNA fragments) is shorter than 2x the read length, there are doubly sequenced base pairs. As sequencing errors are random events with probabilities in the order of 10-4, the probability of observing the corresponding mismatch twice due to sequencing errors at the same position is negligibly small. This helps GRAND-SLAM to more precisely distinguish old and new RNA (but be aware that sequencing errors and 4sU conversions are not the only cause of mismatches!).
After mapping reads to the genome (or transcriptome), mismatches can be observed: For instance, if a read containing a C at some position maps to a location on the genome, where there is a T instead, we speak of a T-C mismatch (this could correspond to a 4sU conversion, or a sequencing error, or an error during reverse transcription or PCR and a few other causes).
The goal of experiments based on metabolic labeling and chemical nucleoside conversion is to distinguish RNA that has been synthesized during the time of labeling (new RNA) or before that (old RNA). We also speak of new reads (reads originating from new RNA) and old reads (reads from old RNA). We advice against calling it labeled and unlabeled, because, technically, an RNA (or read) can be new without being labeled (if there are only few T in the corresponding genomic locus, and even if 4sU was present, no incorporation took place).
The first thing you may want to know about your SLAM-seq (Timelapse-Seq, TUC-seq,...) experiments is the labeling efficiency: How much RNA has been labeled? This is what could be determined with sufficiently long reads without sequencing errors (PCR errors, transcription errors, RNA editing, ...): If there is no other cause for T-C mismatches, and each fragment of new RNA that is sequenced indeed contained a 4sU, one could easily determine for each read, whether is is from new or old RNA, and, by counting and dividing them, the labeling efficiency.
However, due to short reads, and non-negligible other sources for mismatches, this is not possible. Indeed, the labeling efficiency is in fact not even the most interesting parameter of the experiment, as it depends on two things: The percentage of new RNA in the sample and the rate of 4sU incorporation and conversion (refered to as substitution rate). These two parameters constitute the most fundamental principle that GRAND-SLAM is based on: The probability referred to as pc in the paper basically is the substitution rate (to be more precise, it is the probability of observing a T-C mismatch in a new read, and therefore also includes sequencing errors etc.), and the new-to-total ratio (πg) is the percentage of new RNA for a gene g.
Imagine two experiments, both with labeling for 2 hours in the same cell line. In experiment A a low dosage of 4sU was used (e.g. 50μM), in experiment B a high dosage (e.g. 500μM). For both, the amount of new RNA for each gene is the same, but the substitutions rate will be significantly lower in experiment A. Conversely, if the same 4sU dosage was used in both experiments, but A was done on slowly dividing cells, and B on quickly dividing cells, the subsitution rate will be the same (assuming a similar 4sU uptake in both cell types), but B will have much more new RNA. In both cases, the labeling efficiency will be higher in B than in A, but for very different reasons.
If you report any errors, please always use the -D
parameter to run GRAND-SLAM (gedi -e Slam -D ...
). This provides a much more verbose error message!
For GRAND-SLAM you can savely ignore these. The reason for getting these is that you don't have the JavaFX (or openfx) package installed on your system. GRAND-SLAM does not use it.
By default, GRAND-SLAM will only process genes that have biotype protein_coding. Use the parameter -allGenes
to circumvent that!
If you process an excessive amount of samples, you need to convert them to cit first (see here)