Scripts for 5' Seq data analysis in Python 3 (ver 3.6)are split in two groups based where reads were mapped: a) to genome, b) to cDNA.
Scripts are in development stage so don't use for publications.
In addition to basic Python libraries it assumes pysam
, argparse
, pandas
, collections
, glob
to be installed.
# data files in the folder 0-References/
Genome.fa # is link to Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta
Genome.gtf # is link to Saccharomyces_cerevisiae.R64-1-1.95.gtf
#
# reformated and indexed annotation for tabix (part of pysam)
genome.gtf.gz
genome.gtf.gz.tbi
fivePSeqMap.py
maps reads 5' ends. Output is in compressed HDF5 format. HDF file have keys for each strand chromosome; for example: "For_rpm/I"
and "Rev_rpm/I"
for the first chromosome.
# Run in batch
for f in 5-Aligned/*.bam; do ./fivePSeqMap.py -i $f; done
- Get coverage around stop
hdf_2_metagene_tables_stop_5PSeq.py
Creates metagene tables for stop codon. Input is output of previous script (fivePSeqMap.py
) files ending with*_idx_iv.h5
. Output is table with relative coordinates from -120 (default value in nt inside gene) up to 120. 0 position corresponds to 5' position of stop codon.
# command line parameters
./hdf_2_metagene_tables_stop_5PSeq.py
-i input *.h5: None
-prefix part of oufile : None
-annot annotation GTF: 0-References/genome.gtf.gz
-genome genome FastA: 0-References/Genome.fa
-col columns: rpm
-th region threshold: 15
-span included region: 120
-norm to gene coverage: Yes
-subsets by stop codon: None
-glist subsets by list: None
usage:
./hdf_2_metagene_tables_1_dev.py -i WTS1_5-End_21-33_idx_assign_rpm.h5 -prefix WTS1_stop_metagene -norm Yes
# batch run
for f in 6-AssignRaw/*.h5; do fo=${f##*/}; ./hdf_2_metagene_tables_stop_5PSeq.py -i $f -prefix ${fo/_idx_iv.h5} -col rpm -th 18 -span 180 -subsets NO ; done | tee hdf_2_metagene.log
- Merge coverage of different samples
merge_metagene_tbl_5PSeq.py
./merge_metagene_tbl_5PSeq.py -files "8-MetagTbl/*sum*" -o meta-Stop_th18-Span180.csv
- Plot metagene
Plot-STOP-MetaG-5PSeq.ipynb
compute_queuing_5PSeq.py
Computes weighted queuing for each gene with a coverage above threshold.
./compute_queuing_5PSeq.py -h
Computes ribosomes queing score at stop
-h, --help show this help message and exit
-i I input table of gene coverage in *.hd5 format
-o O output file name *.csv
-annot ANNOT GTF annotation file
-th1 TH1 Summary gene coverage 150 nt before stop - 10(rpm) default
-th2 TH2 Background coverage - codon mean from -115 up to Span
-span SPAN Positions before - stop recommended 150 or bigger
-bkg BKG region for bakground con be 1 or 2: 1 - before and 2 - between
peaks
-col COL column for values: "sum"; "rpm"; "counts"
for f in 6-AssignRaw/*.h5; do fo=${f##*/}; ./compute_queuing_5PSeq.py -i $f -th1 15 -th2 0.15 -span 180 -col rpm -o ${fo/_idx_iv.h5/_queuingScores.csv}; done
extract_subsequence_stop.py
./extract_subsequence_stop.py -test Yes
-annot annotation GTF: 0-References/genome.gtf.gz
-genome genome FastA: 0-References/Genome.fa
-b before stop codon: 30
-a after stop codon: 9
-translate translate: No
-list subsets by list: None
-outpf subsets by list: plain
usage:
./extract_subsequence_stop.py -annot 0-References/genome.gtf.gz -genome 0-References/Genome.fa -b 30 -a 9 > outfile.seq
# last amino acid for all genes
./extract_subsequence_stop.py -b 3 -a -3 -translate Yes -outpf Table > last_aa_tbl.txt
# last coding codon
./extract_subsequence_stop.py -b 3 -a -3 -outpf Table > last_coding-codon_tbl.txt
# stop codon
./extract_subsequence_stop.py -b 0 -a 0 -outpf Table > stop-codon_tbl.txt
# stop codon extended by one towards 3 prime
./extract_subsequence_stop.py -b 0 -a 1 -outpf Table > stop-codon-extended-3pr_tbl.txt
# stop codon extended by one towards 5 prime
./extract_subsequence_stop.py -b 1 -a 0 -outpf Table > stop-codon-extended-5pr_tbl.txt
# data file
Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa
Main thing to consider is that cDNA sequences in most cases are starting with AUG (start codon) and ending with stop codon. Therefore nothing before start codon is aligned and region before stop is also poorly covered. Reads are usually 60 nt long and if half of read don't align (because there is no sequence towards 3' end in this cDNA sequence) read is not aligned. This setup complements to genome.
fivePSeqMap2cdnawithOffset.py
Input is:
- cDNA aligned reads in sorted indexed BAM and
- fastA file of cDNAs Output: offset (17 A-Site) corrected coverage for each cDNA
- pickled dictionary with cDNA IDs as keys and pandas table as value Tables index corresponds to codon positions. Columns are "codon"; "amino acid" and "counts"
usage: fivePSeqMap2cdnawithOffset.py [-h] [-i I] [-f F] [-of OF] [-o O]
five prime assinment (offset 17) of cDNA aligned 5'P-Seq reads
optional arguments:
-h, --help show this help message and exit
-i I aligned_to_cDNA_sorted.bam
-f F Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa
-of OF offset for correcting 5prim mapping; default 17 (A-Site)
-o O output pickle file
#P-site offset 15
fivePSeqMap2cdnawithOffset.py -i 5-cdna-Aligned/5PSeq/KO20C_V_5PSeq_dedup.bam -of 15 -o KO20C_V_5PSeq_cdna_off15.pkl
When looking for a given codon (for example AAA) then 60 nt from 5' - and 30 nt from 3' end are excluded.
fivePSeqQSfromcDNA4PSiteCodon.py -codon AAA -i KO20C_V_5PSeq_cdna_off15.pkl -o KO20C_V_QS-coding-region_AAA_P-site.csv
-i input pickled dict : KO20C_V_5PSeq_cdna_off15.pkl
-codon P-Site codon : AAA
-per periodicity in codons: 10
-o output_table.csv : KO20C_V_QS-coding-region_AAA_P-site.csv
4,905 cDNAs from input file
QS for P-Site AAA A-Site NNN pairs
10,000 ...
20,000 ...
30,000 ...
40,000 ...
50,000 ...
60,000 ...
70,000 ...
80,000 ...
83,534 total QS for codon pairs AAA[NNN]!
KO20C_V_QS-coding-region_AAA_P-site.csv