Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple gene names mapping to a single gene ID #217

Open
crj32 opened this issue Mar 29, 2019 · 3 comments
Open

Multiple gene names mapping to a single gene ID #217

crj32 opened this issue Mar 29, 2019 · 3 comments

Comments

@crj32
Copy link

crj32 commented Mar 29, 2019

Hi

I have done the genome guided assembly on my data following the exact steps in the nature protocols paper and I have often multiple gene_names per single gene id in the merged .gtf file. Is this common? This must be incorrect? Because basically the tool has merged several genes together to make its own gene.....

Thanks,

Chris

chr20 StringTie exon 63734154 63734824 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000484569.1"; exon_number "1"; gene_name "ZGPAT"; ref_gene_id "ENSG00000197114.11";
chr20 StringTie exon 63735159 63735236 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000484569.1"; exon_number "2"; gene_name "ZGPAT"; ref_gene_id "ENSG00000197114.11";
chr20 StringTie transcript 63735463 63738441 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3";
chr20 StringTie exon 63735463 63735564 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "1"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3";
chr20 StringTie exon 63737845 63737902 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "2"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3";
chr20 StringTie exon 63737973 63738060 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "3"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3";
chr20 StringTie exon 63738183 63738441 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "4"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3";
chr20 StringTie transcript 63736283 63738234 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";
chr20 StringTie exon 63736283 63736396 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "1"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";
chr20 StringTie exon 63737533 63737647 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "2"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";
chr20 StringTie exon 63737821 63737902 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "3"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";
chr20 StringTie exon 63737973 63738060 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "4"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";
chr20 StringTie exon 63738183 63738234 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "5"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";

@mrijnkels
Copy link

Hi, we have the same issue. we see it for genes that are next to each other and genes that are several 100kb apart.
Would really like to find out how to prevent this as it makes the merged stringtie file not very usefull

@gpertea
Copy link
Owner

gpertea commented May 2, 2019

This is a difficult issue to solve within StringTie, which makes assembly decisions based primarily on the read alignment data. Reference annotation is often imperfect and lacking, and in order to allow for the discovery of novel isoforms, StringTie always uses the read alignments as the basis of transcript assembly. Unfortunately read alignments can also be wrong/imperfect and may actually "bridge" neighboring genes, as it seems to be the case in the situations you are reporting here.

Using a better or more stringent read alignment strategy may help with this problem. Or some post-alignment filtering can be applied to the alignment data in order to eliminate large, low scoring alignments which seem to spuriously "connect" neighboring genes.

@mrijnkels
Copy link

So any suggestions on how to generate a better more stringent alignment strategy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants