Pseudogenes detection #125
-
Hi! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @YiJessePi, Currently, Bakta uses CDSs that remain as We also think about using larger genomic regions without any annotated features and to feed them into the above mentioned pipeline as seed sequences. So, yes, there is still much to implement and improve, but it's a huge piece of new code that needed to be implemented and we only started to implement this in this release. |
Beta Was this translation helpful? Give feedback.
Hi @YiJessePi,
sure and thanks for reaching out! The methodology that is currently implemented is far from being comprehensive but rather a solid starting ground to catch more different cases in the upcoming releases.
Currently, Bakta uses CDSs that remain as
hypothetical proteins
as seed sequences. Bakta then searches for reference proteins in its PSC database using relaxed thresholds for sequence identity and subject coverage of 80% and 40%, respectively. PSC references with sufficient hits are then aligned against the 6-frame translated CDS sequences elongated 300 bp in up- and downstream directions. These alignments are then analyzed in order to detect pseudogenization causes, e.g. is…