PolyTailor is a software to study RNA tails. It predicts:
- per-read polyA tail lengths estimates
- per-read tail heterogeneity (i.e. non-A features)
This software is meant to be used with Nano3P-seq cDNA libraries, and can work with Nano3P-seq libraries sequenced using R9 and R10 flowcells.
You'll need:
- samtools v1.19+
- minimap2 v2.28+
- IsoQuant 3.5+
- Python 3.8+ with following packaged installed: matplotlib numpy parasail pybedtools pysam pandas scipy seaborn htseq
All above can be installed using conda and pip:
conda create -c conda-forge -c bioconda -n polyTailor python=3.10 samtools minimap2 isoquant
conda activate polyTailor
pip install matplotlib numpy parasail pybedtools pysam pandas scipy seaborn htseq
mkdir -p ~/src && cd ~/src
git clone https://github.com/novoalab/polyTailor.git
- dorado v0.7.2+ and basecalling models. Here, we assume Linux x64 system. For other systems, please see dorado page .
mkdir -p ~/src && cd ~/src
wget https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.7.2-linux-x64.tar.gz
tar xpfz dorado-0.7.2-linux-x64.tar.gz
echo 'export PATH=~/src/dorado-0.7.2-linux-x64/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
dorado download --directory ~/src/dorado/models --model dna_r9.4.1_e8_hac@v3.3
dorado download --directory ~/src/dorado/models --model dna_r9.4.1_e8_sup@v3.3
dorado download --directory ~/src/dorado/models --model dna_r10.4.1_e8.2_400bps_hac@v5.0.0
dorado download --directory ~/src/dorado/models --model dna_r10.4.1_e8.2_400bps_sup@v5.0.0
PolyTailor can be executed as follows (steps 3-4 are optional):
dorado basecaller -x cuda:all --emit-moves -r MODEL [--kit-name BARCODING_KIT] pod5_dir > reads.bam
For the most accurate poly-T composition calling we recommend using the latest sup
model.
If barcoding --kit-name
is provided, barcode will be reported in barcode
column.
samtools fastq -F2304 -T mv,ts,BC reads.bam|minimap2 -y -ax splice:hq genome.fa -|samtools sort --write-index -o algs.bam
src/get_transcript_ends.py --firststrand -q0 -o transcript_ends.tsv.gz -a genome.gtf -b algs.bam [algs2.bam ... algsN.bam]
isoquant.py --complete_genedb --data_type nanopore -o isoquant -r genome.fa -g genome.gtf --stranded reverse --bam algs.bam
src/get_pT.py -o pT.tsv.gz -b algs.bam \
-e transcript_ends.flt.tsv.gz.bed
-i <(zgrep -v '^#' isoquant/OUT/OUT.read_assignments.tsv.gz | cut -f1,4,6,9)
PolyTailor will produce a TAB-delimited file with following columns:
- read_id
- barcode - detected barcode (
unknown
if no barcode detected) - mapq - mapping quality
- filter -
OK
means that following filters were passed- read has 5' clipped part corresponding to: adapter, N3PS primer and pT (otherwise
no_clip
) - N3PS primer was detected in the 5' clipped part (otherwise
no_primer
) - the N3PS primer aligned end-to-end (otherwise
not_complete
) - pT sequence was detected between primer and aligned transcript (otherwise
no_pT
) - the pT is immediately following primer (otherwise
not_continuous
)
- read has 5' clipped part corresponding to: adapter, N3PS primer and pT (otherwise
- pt_len - estimated poly-T length.
- per_base - helicase speed estimated from mv table (mean number of chunks per base)
- primer_end - position of the N3PS primer end in the read sequence
- pt_start - position of the poly-T start in the read sequence
- before_pt - sequence before detected poly-T (terminal bases of poly-A)
- pt_seq - poly-T sequence composition
- transcript_end - present if read end is associated with any of predicted transcript ends
- distance - distance from the transcript end
- comments - additional fields passed from
-i / --readids
file
Note, there may be multiple comments columns,
depending on provided -i / --readids
file.
For isoquant
example above, you'll see:
- isoform_id
- assignment_type
- additional_info
You can find test data and example outputs in test.
If you find this work useful, please cite: Begik O*, Pryszcz LP*, Niazi AM, Valen E, Novoa EM. Nano3P-seq: a protocol to chart the coding and non-coding transcriptome at single molecule resolution. (in preparation)
If you have an issue running this code, please open a new Github issue. Please take a look at previous issues, even if closed. Thanks!