Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly large "transposases" #45

Closed
clb21565 opened this issue Apr 12, 2022 · 6 comments
Closed

Overly large "transposases" #45

clb21565 opened this issue Apr 12, 2022 · 6 comments

Comments

@clb21565
Copy link

Hi there, this is a lovely tool. I am noticing however that the .faa file produced appears to be translating more than just transposases - including sometimes up to 4+ genes (e.g., see below). I guess I am wondering how to interpret this -- any guidance would be much appreciated.

image

@xiezhq
Copy link
Owner

xiezhq commented Apr 13, 2022

Hi clb21565,

I don't know what your question is. Could you give more detials about what the picture is? How did you get the search results in the picture?

Xie

@clb21565
Copy link
Author

Thanks! This image is the result of searching one of the entries in the orf.faa file against NCBI NR. The protein is conspicuously large, see here:

unnamed protein product
KKLLIKWATGFLNRCGYVVEPKAKPMDNVPAFIRQMEGNKMSREKLNTTQRALRILKALKGRSLTGLTNK
ELCEAIGETPVNVTRAIAFLEAEGFXATVKHWGVWFELSDFADCSKPRDGNAESLGTIGTGASKGASGCF
LIKICNQENRTMSELTLSQEQNAVALAAKAMTQDLAEAHEAMGMIKAFTFVGKLATVATLKKLAEVKEAR
NYKGLQYVNADGELATVASWEEFCTACGTSRRKVDEDLQNLNQFGEEFMETSQRLGLGYREMRKLRQLPE
EARAEIVDADYSETTDKEDLIEKIEDLTAKHAKEKESLTKQLESVKANYDAQAKVIANKDERLNKLDKEL
AKKTLLIETQTPDQRGGMLREEAAQISYKAEAILRGQVFQAFEALQTHQEEHGIDHRQFMSGVLAEYQLI
LSELKERFNLTDEPTGDNLPEWAKPEYADKPSVEPSIAAILDEVSDAQIWSKDDAMAILPSVLSHWANRV
ETAKFGETEKVIDEGCKQTGLSRATFLRQIKPYRPKSNRKVRSDKGKHQLEKAELDLISAAWLHLRQKNG
KTMATLERVLDILRANHRIKAEFIDENTGEVRPYSATSVERALRNANLHPDQLLRPAPVVQLQSKHPNHV
WQIDRHCVLYYLKETGKGNGLCIMEEGEFYKNKPANVAKVEPQRVWRYVITDHTSGVIYVEYVYGGETAE
NVSQCFINAIQPKANKAEPFFGVPKILMFDRGTANTSQMFSHLLHQLDVKVEIPKAKNARAKGQVEKGND
IVERQFESGLRFMNVSGLDELNQLAHQWMRYFNGKMVHSRHGRTRYQMWQFIRPEQLIMPPSREICQELM
ITALSERVVSDKLEISFESRRYDVRDVPDAKVGEKITVGKNPYRPECVQVQCFERVVDEDGSENLKPYWV
VVEPVEVNEYGFRVDAAMIGEEYKAHKKTEFETHKEQAEQLAYGVTNEDDLKRAKKVNKPLFNGEINPYK
HIEETNLNWFVPKKGQDHELTTNARRVEQKPVNLVECAKQLKERFPEWNGKHYKNLAKHFSEGVPITTLE
DWLQGNKLPEVLNPETKILQLNAPNFDKWRFYVLKLKQVLIDKGVSLRQLAQQMNVSPATVSQLINHNQR
VKQWVEFEKNLGSALQSLGIIEPLASLLEMEGTGESLATEPVPSAPKTTDEIKDEIMLLAKQALFPATKK
HFGLFRDPFAEDVRSADDVFSSADVRYVREALFQTAKHGGFMAVVGESGAGKSTLRRDLIDRINQENAPI
TVIEPYIIAMEDNDVKGKTLKAAHIAEAIISTLSPLESVKRSPEARFRQLHKVLKESVKSGYSNVLIIEE
AHALPIPTLKHLKRFFELEDGFKKLLSIVLIGQPELKIKLSERNTEVREVVQRCEIVELAPLDAELERYV
EHKLERVGKKLSDIFEEDAFVAVRQRLTAVGRNKTSQSLLYPLAVGNLLTAAMNLAESLGIPKVNGQVVM
MCKKVLGSITRANEEAFANFCYDFIKLVINSPEVIVSALIYGIEQFDLVEDENGKKSIEVKLYDKEKENG
EESSTIKADVHELNLQTADDVALAIKEIGDLERERVRLATLQADEKAVIDEKYTAKLTALKDKVKPLQKA
VQAYCESRRDVLTNGGKQKTAYFPTGEVQWRVKPPAVVAKGLESILDSLRKLGLFRFIRTKEELDKEAML
KEPEIARSISGISIREGVEEFVIKPNDXGGAKMTPSAKTERQFMYKEKAEAAARCEQLGNYQQAYNLWCE
AMKLATTEKQKNGVALEQIIVILGKASGACEMIDSLEQLKMQLQQAVRQLEQAEKAIDENELPLAQCYVF
TAKNLIMKLGLKMT

This is what was returned using ISEscan (default settings). Searching NCBI for this sequence using blastp, I had alignments to multiple proteins suggesting that somehow this was a fused open reading frame of multiple coding regions (as in the picture above). Many of the orf.faa files have proteins like this which are definitely erroneous. Rerunning prodigal on the IS fna files produces multiple ORFs.

Hope this clears things up, but let me know if I can provide additional information. I appreciate the fast response.

Connor

@xiezhq
Copy link
Owner

xiezhq commented Apr 13, 2022

You are right. The Fragenescan ISEScan used to predict gene/protein is a good tool for dealing with frameshift issues but its predictions are sometimes quite different (sometimes incorrectly like the prediction in your case) from the predictions reported by other gene prediction tools. ISEScan refines the boundary of the predicted IS element copies when searching for IS element copies, especially for multi-copy IS elements, but does not change boundaries of the predicted transposases. So, there might be very few cases where the transposase reported by ISEScan is larger than the corresponding IS element copy reported by ISEScan. The best solution is to feed ISEScan accurate gene/protein sequences instead of using Fraggenescan or any other single gene prediction tool to predict gene/protein sequences, but I probably will add this feature in ISEScan in the future to allow users to feed gene/protein sequences for their genomes.

Xie

@xiezhq xiezhq closed this as completed Apr 16, 2022
@clb21565
Copy link
Author

Hi there, another follow-up here. We have noticed that many of the predicted IS are also much much too large - for instance some that are on the order of ~50 kbp. Is there a reason why this is happening, and would you have advice on how to fix it?

@xiezhq
Copy link
Owner

xiezhq commented May 26, 2022

One reason might be:

FragGeneScan used by ISEScan predict a inaccurate gene (later predicted/classifed as transposase by ISEScan because it is hit by trasnpasase model) which either fully covers a real IS element or largely overlapped with a real IS element. ISEScan always to try to extend the predicted transposase till it find/locate the TIR sequences at left and right end. In such case, ISEScan might not be able to find the real TIR sequences (within the predicted gene/transposase) of the real IS element. Insteadly, it could find the fake TIR sequences (outside the predicted gene/transposase of the much larger IS element with incorrect boudaries because it is relatively easy to find two SHORT inverted repeat sequences in a larger space (longer sequence) along the DNA sequence.

There is no perfect (automated) solution to fix it before ISEScan is upgraded with a new feature allowing the user provided accurate gene sequences (actually the translated protein sequences). The only way to fix it is to replace the ISEScan predicted gene/protein sequences with your correct gene/protein sequences and re-run ISEScan:

  1. Find the incorrectlly predicted IS element (e.g. a very large IS element which you think is too large) in the ISEScan predictions.
  2. Copy the nucleic acid sequence of the predicted IS element, and then use other tools or BLAST search to predict or search correct (at least you think it is correct) gene sequence (and protein sequence).
  3. After you obtain the correct gene/protein sequence from step 2 above, please find the proteome (protein sequences in file *.faa in directory results/proteome, e.g. NC_012624.fna.faa) predicted/translated from your DNA sequence(s) in your fasta file (your genome). In the .faa file, you need find all protein sequences whose gene sequences overlap with the incorrectly predicted IS element, and then replace those incorrectly predicted protein sequences with the correct protein sequences. Accordingly, you also need to update the starting and ending positions of the corresponding genes in the corresponding gene/protein description lines starting with '>', which is the last part of the gene/protein description line. For example, you have a description line in file .faa , >gi|228288719|ref|NC_012624.1|_995_2377_+, in which 995_2377_+ shows the starting and ending positions of a gene on strand + are 995 and 2377, respectively.
  4. After you replace all incorrectly predicted genes/proteins in .faa file, you need to delete the corresponding HMM hits (two files for each .faa) in directory results/hmm, e.g. clusters.faa.hmm.NC_012624.fna.faa and clusters.single.faa.NC_012624.fna.faa.
  5. Re-run ISEScan as you did in the last time, ISEScan will skip translating your genome into proteome but will search/predict transposases and IS elements using the protein sequences and gene positions in the updated file .faa.

The files in results/proteome are generated by FragGeneScan. The files in results/hmm are generated by HMMER.

Hope this helps.

Xie

@clb21565
Copy link
Author

Xie, thanks for the detailed solution here- I will try this out.

@xiezhq xiezhq mentioned this issue Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants