-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overly large "transposases" #45
Comments
Hi clb21565, I don't know what your question is. Could you give more detials about what the picture is? How did you get the search results in the picture? Xie |
Thanks! This image is the result of searching one of the entries in the orf.faa file against NCBI NR. The protein is conspicuously large, see here:
This is what was returned using ISEscan (default settings). Searching NCBI for this sequence using blastp, I had alignments to multiple proteins suggesting that somehow this was a fused open reading frame of multiple coding regions (as in the picture above). Many of the orf.faa files have proteins like this which are definitely erroneous. Rerunning prodigal on the IS fna files produces multiple ORFs. Hope this clears things up, but let me know if I can provide additional information. I appreciate the fast response. Connor |
You are right. The Fragenescan ISEScan used to predict gene/protein is a good tool for dealing with frameshift issues but its predictions are sometimes quite different (sometimes incorrectly like the prediction in your case) from the predictions reported by other gene prediction tools. ISEScan refines the boundary of the predicted IS element copies when searching for IS element copies, especially for multi-copy IS elements, but does not change boundaries of the predicted transposases. So, there might be very few cases where the transposase reported by ISEScan is larger than the corresponding IS element copy reported by ISEScan. The best solution is to feed ISEScan accurate gene/protein sequences instead of using Fraggenescan or any other single gene prediction tool to predict gene/protein sequences, but I probably will add this feature in ISEScan in the future to allow users to feed gene/protein sequences for their genomes. Xie |
Hi there, another follow-up here. We have noticed that many of the predicted IS are also much much too large - for instance some that are on the order of ~50 kbp. Is there a reason why this is happening, and would you have advice on how to fix it? |
One reason might be: FragGeneScan used by ISEScan predict a inaccurate gene (later predicted/classifed as transposase by ISEScan because it is hit by trasnpasase model) which either fully covers a real IS element or largely overlapped with a real IS element. ISEScan always to try to extend the predicted transposase till it find/locate the TIR sequences at left and right end. In such case, ISEScan might not be able to find the real TIR sequences (within the predicted gene/transposase) of the real IS element. Insteadly, it could find the fake TIR sequences (outside the predicted gene/transposase of the much larger IS element with incorrect boudaries because it is relatively easy to find two SHORT inverted repeat sequences in a larger space (longer sequence) along the DNA sequence. There is no perfect (automated) solution to fix it before ISEScan is upgraded with a new feature allowing the user provided accurate gene sequences (actually the translated protein sequences). The only way to fix it is to replace the ISEScan predicted gene/protein sequences with your correct gene/protein sequences and re-run ISEScan:
The files in Hope this helps. Xie |
Xie, thanks for the detailed solution here- I will try this out. |
Hi there, this is a lovely tool. I am noticing however that the .faa file produced appears to be translating more than just transposases - including sometimes up to 4+ genes (e.g., see below). I guess I am wondering how to interpret this -- any guidance would be much appreciated.
The text was updated successfully, but these errors were encountered: