Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stringtie merge merging multiple real genes #190

Open
jokelley opened this issue Jul 28, 2018 · 6 comments
Open

stringtie merge merging multiple real genes #190

jokelley opened this issue Jul 28, 2018 · 6 comments

Comments

@jokelley
Copy link

I am using stringtie merge to find additional genes for a species, however, when I run the reference guided merge the program combines real genes that are close together, two examples below. Is there a way to force the merge to keep the existing reference genes and annotate new genes? Or is there a program that would do this? I need to keep the existing gene set while also identifying possible novel genes.

MSTRG.5380 | rna7849 | KIF7
MSTRG.5380 | rna7851 | PLIN1
MSTRG.5380 | rna7850 | PLIN1
MSTRG.5380 | MSTRG.5380.1

MSTRG.5552 | rna8105 | DVL2
MSTRG.5552 | rna8106 | PHF23
MSTRG.5552 | rna8107 | GABARAP
MSTRG.5552 | MSTRG.5552.1

@Kennyluo4
Copy link

I have the same issue with it. Do you figure out how to solve this now? I tried to use -g parameter to limit the gap for merging transcripts but it didn't work. It's impossible to continue with the DEG and other downstream analysis when you have a "gene" that included several reference genes. I look into one assembled "gene" that contain 5 reference genes. There is no overlap between these genes, some even have 10 kb gap. Yet, they are merged together as a new "gene".

@jokelley
Copy link
Author

jokelley commented Aug 2, 2019 via email

@Kennyluo4
Copy link

I was not able to solve this within stringtie. Perhaps there have been updates? Stringtie developers, any insight?

On Fri, Aug 2, 2019 at 2:25 PM Ziliang Luo @.***> wrote: I have the same issue with it. Do you figure out how to solve this now? I tried to use -g parameter to limit the gap for merging transcripts but it didn't work. It's impossible to continue with the DEG and other downstream analysis when you have a "gene" that included several reference genes. I look into one assembled "gene" that contain 5 reference genes. There is no overlap between these genes, some even have 10 kb gap. Yet, they are merged together as a new "gene". — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#190?email_source=notifications&email_token=AAQA5STQG4G45SB63NG6GPLQCSQ6DA5CNFSM4FMUNNU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3O36RY#issuecomment-517848903>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQA5SXD6QFSOKBMRZGQN5DQCSQ6DANCNFSM4FMUNNUQ .

So, do you use other assembler for your study? I saw some suggestions to use the simplified protocol to ignore the novel transcripts and merging step, or make changes in alignment step. But I don't want to play with the alignment settings, because it takes too much time. I also don't want to lose novel transcripts using simplified method because I'm try to identify lncRNA and study alternative splicing.
I double checked the genes incorporated with multiple ref genes, they are merged because there are some novel transcripts/isoforms spanning across ref genes. That's why they are merged despite that ref genes are distant from each other. I tried to use stringent parameters for assembly. E.g. increasing value for -f (fraction of isoforms) -j (junction coverage) -c (coverage allowed for the predicted transcripts). The result is better, some merged ref are separated but there are still some merged together.

@angarb
Copy link

angarb commented Apr 3, 2020

Hi, @Kennyluo4 @jokelley @gpertea!

Was a solution ever determined? We are having the same problem! We we love to incorporate novel splice sites into out analysis, but removing the -e option seems to result in these merged long transcripts spanning multiple genes.

Thanks for any input!

@jokelley
Copy link
Author

jokelley commented Apr 3, 2020 via email

@Kennyluo4
Copy link

@angarb, You can only alleviate it by playing around the assembly parameters. Or you should adjust the alignment stringency to improve the alignment acuracy. If there are really many reads pairs linking the two "gene model"s, you should probably trust your data. Not all genome annotations are perfect and the evidence they used for annotation is not complete. There is chance that your assembled transcripts from your specific tissue/cell line are real despite the difference. If you realy trust the annotation and the alignment is very good, this issue may be related to alternative splicing or gene fusion events.
You can try other assemblers such as Trinity for the analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants