-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Same gene id for multiple genes #270
Comments
Hi, I had the exact same problem - different genes being tagged with the same unique MSTRG identifier from stringtie --merge command. My work-around was to use this python script (https://gist.github.com/e2b3e89fc0220e4363765fadc8c1dd73), which modifies the merged.gtf output file by adding a suffix onto MSTRG identifiers for each different gene_name, and if there is no gene_name, then it adds a suffix onto MSTRG identifiers for each different transcript_id. Then I used this modified gtf file in stringtie -e command. |
Hi Jessica,
I hope you are doing well. I believe this script will be of much help, can you please tell me how I need to use this script, means do I just need to provide merged gtf as input or also need to provide merged gtf and reference gtf as input, sorry I am not good in python so any help will be highly appreciated.
Best Regards,
Muhammad Waqas Khokhar
…________________________________
From: Jessica Hayward <notifications@github.com>
Sent: 30 September 2020 17:32
To: gpertea/stringtie <stringtie@noreply.github.com>
Cc: Khokhar, Muhammad (w.w.khokhar40@canterbury.ac.uk) <w.w.khokhar40@canterbury.ac.uk>; Author <author@noreply.github.com>
Subject: Re: [gpertea/stringtie] Same gene id for multiple genes (#270)
Hi, I had the exact same problem - different genes being tagged with the same unique MSTRG identifier from stringtie --merge command. My work-around was to use this python script (https://gist.github.com/e2b3e89fc0220e4363765fadc8c1dd73), which modifies the merged.gtf output file by adding a suffix onto MSTRG identifiers for each different gene_name, and if there is no gene_name, then it adds a suffix onto MSTRG identifiers for each different transcript_id. Then I used this modified gtf file in stringtie -e command.
This isn't a solution to the problem for novel isoforms but thought I'd put it here in case it is helpful to others.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#270 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACQRPYOWTVME3YBBKLCPGVLSINMS7ANCNFSM4MVVP7FA>.
|
You just need your merged gtf file and then to run the script just use: |
Many thanks Jessica
Best Regards,
Muhammad Waqas Khokhar
PhD (Student)
School of Human and Life Sciences
Canterbury Christ Church University (CCCU)
Canterbury, Kent, UK.
…________________________________
From: Jess_Hay <notifications@github.com>
Sent: 02 October 2020 15:05
To: gpertea/stringtie <stringtie@noreply.github.com>
Cc: Khokhar, Muhammad (w.w.khokhar40@canterbury.ac.uk) <w.w.khokhar40@canterbury.ac.uk>; Author <author@noreply.github.com>
Subject: Re: [gpertea/stringtie] Same gene id for multiple genes (#270)
You just need your merged gtf file and then to run the script just use:
python unique_gene_id.py merged_file.gtf
The output just has ".unique_gene_id.gtf" on the end.
Good luck!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#270 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACQRPYKJT2WVDNXBQLCVRCLSIXM35ANCNFSM4MVVP7FA>.
|
I am interested to perform the differential gene expression (DE) and alternative splicing analysis of my RNA-seq dataset using Texudo2 approach as recommended in Pertea et al., 2016 (doi:10.1038/nprot.2016.095). I also want to use Deseq2 for DE analysis.
we have built a genome index using the command:
hisat2-build
--ss Brassica_napus.AST_PRJEB5043_v1.43.ss
--exon Brassica_napus.AST_PRJEB5043_v1.43.exons
-f Brassica_napus.AST_PRJEB5043_v1.dna.toplevel.fa
-p 4
napus
Then we aligned the reads using the following command:
hisat2 -x genome/napus -1 fastq/${R1[$i]} -2 fastq/${R2[$i]} -S S${sample[$i]}.sam -p 4
Here I forget to use --dta parameter
Next, we sort and index the files:
samtools sort -@ 4 -o S${sample[$i]}.bam S${sample[$i]}.sam
samtools index S${sample[$i]}.bam S${sample[$i]}.bai
The bam files were assembled using command:
stringtie S${sample[$i]}.bam -G genome/Brassica_napus.AST_PRJEB5043_v1.43.gtf -o S${sample[$i]}.bam -l S${sample[$i]} -p 4
Next, We merged the assembles using stringtie --merge:
stringtie --merge -G genome/Brassica_napus.AST_PRJEB5043_v1.43.gtf -o Bnapus.gtf -p 4 merge_Bnapus_gtf.names
Merged assembly was compared using gffcomapre:
gffcompare -r genome/Brassica_napus.AST_PRJEB5043_v1.43.gtf -o Bnapus_compare Bnapus.gtf
This merged assembly was used to quantify expression using the command:
stringtie -e -B -G Bnapus.gtf -o ballgown/$sampleName/$sampleName.gtf -p 4 bam/$sampleName.bam
Now the problem is that I am getting the same gene id for different genes, I have checked different threads and noticed that this issue is raised several times (#159, #118, #153, #190, #217). It addressed well in some threads and is quite clear from this statement:
StringTie makes assembly decisions based primarily on the read alignment data. Reference annotation is often imperfect and lacking, and in order to allow for the discovery of novel isoforms, StringTie always uses the read alignments as the basis of transcript assembly. Unfortunately read alignments can also be wrong/imperfect and may actually "bridge" neighboring genes, as it seems to happen in our case where we are getting the same gene id for different genes.
I believe this issue can be sorted by three main approaches as mentioned in previous threads:
Using a better or more stringent read alignment strategy may help with this problem. Or some post-alignment filtering can be applied to the alignment data in order to eliminate large, low scoring alignments which seem to spuriously "connect" neighboring genes. This could be very challenging and time-consuming.
We can try to use stringent parameters for assembly. E.g. increasing value for -f (fraction of isoforms) -j (junction coverage) -c (coverage allowed for the predicted transcripts). This approach will help us up to some extent as some merged ref will be separated but there will still remain some merged together.
An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step to avoid the case that One internally generated locus IDs like MSTRG with several gene names.
Keeping in mind all these points, I still need your suggestions for some points to proceed further with downstream analysis.
I didn't use --dta/--downstream-transcriptome-assembly (Report alignments tailored for transcript assemblers including StringTie. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. This leads to fewer alignments with short-anchors, which helps transcript assemblers improve significantly in computation and memory usage.) which can be helpful for the downstream analysis. So my question is that do I really need to use it and have to do the alignment again or am I good to proceed for downstream analysis?
Is it a good idea to proceed with just uniquely mapped reads, maybe this help to reduce the false-positive and assist to sort out the problem of same gene id for multiple genes
If I used the alternate approach (which has only 3 steps) can I still detect the alternative splicing using ASTALAVISTA?
If I keep my focus on just known genes and perform DE analysis based on it, can we still publish good quality stuff?
How about performing analysis using following two complementary approaches to quantify expression and then perform DE analysis based on both approaches:
a) Stringtie
b) Salmon
Looking forward to hearing from you. please let me know if you need any further details.
The text was updated successfully, but these errors were encountered: