-
Notifications
You must be signed in to change notification settings - Fork 26
Home
Dynamics analysis of Alternative PolyAdenylation from multiple RNA-seq data
The dynamic usage of the 3’untranslated region (3’UTR) resulting from alternative polyadenylation (APA) is emerging as a pervasive mechanism for regulating mRNA diversity, stability and translation. Though RNA-seq provides the whole-transcriptome information and a lot of tools for analyzing gene/isoform expression are available, very few tool focus on the analysis of 3’UTR from standard RNA-seq. DaPars v2 is the next generation of DaPars that directly infers the dynamic alternative polyadenylation (APA) usage by comparing standard RNA-seq from multiple samples. Given the annotated gene model, DaPars v2 can infer the de novo proximal APA sites as well as the long and short 3’UTR expression levels. Finally, the dynamic APA usages of each samples will be identified.
Dapars2 requires the following file formats as input:
Format | Description | Example |
---|---|---|
BED | a BED12 file of gene models. It can be downloaded from UCSC | hg38_wholeGene_annotation.bed |
Bedgraph | files in this format store the reads alignment information, which can be converted from BAM files by bedtools (e.g.:"bedtools genomecov -ibam *.bam -bga -split -trackline") | sample_1.wig |
plain text | a tab-delimited two columns file contains Refseq ID mapping with gene name | hg38_refseq_IDmapping.txt |
1.1 Download reference gene annotation from UCSC Table browser
Gene 3'UTR regions can be extracted from reference gene annotation, any reference gene annotations available at UCSC table browser are supported by DaPars2. Follow the guide shown in the figure below to get the gene annotation and the ID mapping between RefSeq gene id and gene symbol.
NOTE: DaPars2 is not limited to human reference genome hg38, but also open to other human reference genomes and reference genomes of other species. If users choose other version of genomes, please replace the choosed species in red box 1 and assembly version in red box 2. The important thing is, the reference genome annotation used should be consistent with the one used in generating RNA-seq bam files.
![image](https://user-images.githubusercontent.com/30954983/174078252-5bcfbf69-4c65-48bc-91c6-5c053e6e27c0.png)
- choose species in box 1
- choose reference genome version in box 2
- select "Genes and Gene Predictions", NCBI RefSeq", and "RefSeq All" in box 3
- choose "BED" in box 4
- check "name" in box 5 and "name2" in box 6
DaPars2 will use the extracted distal polyadenylation sites to infer the proximal polyadenylation sites based on the alignment wiggle files of all samples. The output in this step will be used by step 3.
Run the following commands to generate 3‘UTR region annotation:
python DaPars_Extract_Anno.py -b hg38_wholeGene_annotation.bed -s hg38_refseq_IDmapping.txt -o hg38_3UTR_annotation.bed
-h, --help Show this help message and exit.
-b GENE_BED_FILE, --bed=GENE_BED_FILE The gene model in BED format. The BED file can be downloaded from UCSC Table Browser.
-s ID_mapping_FILE, --gene_symbol_map=ID_mapping_FILE The ID mapping between RefSeq transcript and gene symbol.
-o OUTPUT_FILE, --out-preifx=OUTPUT_FILE The extracted annotation region will be stored into this file.
DaPars2 will use number of mapping reads for each samples to normalize library sizes. The output in this step will be used by step 3.
The format of the mapping reads file is:
wig | depth |
---|---|
wig/sample1.wig | 76485651 |
wig/sample2.wig | 57005615 |
wig/sample3.wig | 68960763 |
The first column is the wig file location and the second column is the corresponding mapped reads. The order of the samples in this file must be the same as the order of wig files in step 3.
Dapars2 uses multiple threads, Please run the following commands to perform DaPar2 analysis.
python DaPars2_Multi_Sample_Multi_Chr.py Dapars2_configure_file chrList.txt
The "Dapars2_configure_file" contains most of the parameter settings for running DaPars2.
The format of a configure file is:
# Specify the reference of 3'UTR region
Annotated_3UTR=RefSeq_hg38_3UTR_annotation.bed
# A comma separated list of wig files of all samples
Aligned_Wig_files=sample1.wig,sample2.wig
Output_directory=Dapars2_test/
Output_result_file=Dapars2
# Specify Coverage threshold
Coverage_threshold=10
# Specify the number of threads to process the analysis
Num_Threads=4
# Provide sequencing depth file for normalization
sequencing_depth_file=mapping_wig_location_with_depth.txt
The "chrList.txt" contains all chromosomes need to be analyzed. It can be extracted from "hg38_3UTR_annotation.bed" generated in Step 2 using command below:
> cat hg38_3UTR_annotation.bed |cut -f1|sort|uniq > chrList.txt
If you want to analyze each chromosome separately to save time and memory, Run the following commands to perform DaPars2 analysis:
python Dapars2_Multi_Sample.py Dapars2_configure_file current_processing_chromosome
Dapars2 will output a tab-delimited table contains at least 5 columns (as shown below). The first four columns are: Gene, fit_value, Predicted_Proximal_APA, Loci. And the additional columns represent PDUI values across samples.
"Predicted_Proximal_APA" means predicted proximal Poly(A) site; "Loci" means the 3'UTR region of the transcript.
We have built a docker image in name of "3aqtl_pipe" which includes the source codes of a pipeline for 3'aQTL mapping. The source codes of Dapars2 are also involved in this image. The "3aqtl_pipe" has been pushed to Docker Hub.
Users can use the 3'aQTL pipeline after creating a container from "3utr/3aqtl_pipe" image.
DaPars2
- DaPars2 v2.1 is released
- Lei Li: lei.li.bioinfo@gmail.com