A pipeline to identify A-to-I RNA editing sites using RNA-seq data. This method was adapted from this paper by Ramaswami et al. (2013), following GATK's most current best practices.
- Run 2-pass mapping using STAR (VCmapSTAR.sh, then 2pass_VCmapSTAR.sh).
NOTE: Check BAM file with Picard's ValidateSamFile (validbam.sh) each time a BAM is generated.
- Add read group using Picard's AddOrReplaceReadGroups (addReadGroup.sh).
- Identify and remove duplicate reads with Picard's MarkDuplicates (picardup.sh).
- Filter reads with low MAPQ (<20) with samtools (filtersam.sh).
- Index BAM file from the previous step (index.sh).
- Split N Trim BAM file of N CIGAR reads using GATK's SpliNCigarReads (splitncigar.sh).
- Base Score Recalibration with GATK's BaseRecalibrator (base_recalibrator.sh).
- Apply base recalibration with GATK's applyBQSR (applybqst.sh), then run variant calling with GATK's HaplotypeCaller (gvcf_haplotypeCaller.sh).
- Merge GVCF files into a single VCF file with GATK's GenotypeGVCFs (genotypegvcfs.sh).
NOTE: Check VCF file with Picard's ValidateVCF (validvcf.sh).
- Variant Score Recalibration with GATK's VariantRecalibrator (variantRecalib.sh), then applyVQRSR (applyvqsr.sh) to generate a variant-recalibrated VCF.
- Select only variants from VCF (snponly.sh), then filter variants against known SNPs [avsnp138] and splicing junctions [dbscsnv11] with ANNOVAR (inputannovar.sh, dbsnp_annovar.sh and spl_annovar.sh).
- Filter only A-to-I editing sites (AtoIFilter.sh).
- Separate variants in Alu and non-Alu regions (alufilter.sh).
- For in Alu variants, directly annotate to UCSC's knownGene (knownGene.sh).
The rest of the steps are meant for non Alu variants.
- Remove simple repeats, annotation from UCSC's RepeatMasker (bedfilter.sh).
- Remove variants in homopolymer regions (homopolymer.sh).
- Ensure unique mapping using BLAT (BLAT.sh).
- Separate variants into repetitive and non-repetitive non Alu variants (repeatorno.sh).
- Annotate to UCSC's knownGene (nonALU_knownGene.sh).