Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alignment of protein sequences containing "*" #1

Open
matchy233 opened this issue Sep 7, 2021 · 1 comment
Open

Support alignment of protein sequences containing "*" #1

matchy233 opened this issue Sep 7, 2021 · 1 comment

Comments

@matchy233
Copy link

I'm using the C API of block-aligner to align protein sequences from UniProt database. There are *s in some protein sequences. Currently using block-aligner to align sequences containing * will cause a Segmentation Fault. Although the users can resolve it by mapping * to other supported chars, it would be nice if we can support * internally! :)

@Daniel-Liu-c0deb0t
Copy link
Owner

Daniel-Liu-c0deb0t commented Sep 8, 2021

I'm not sure if * will every be directly supported internally. It will always have to be mapped to some character that fits within the scoring matrix, so SIMD lookups can be done. Right now, the amino acid matrix supports alphabetical characters A-Z.

There are a couple of ways this could be solved:

  1. An unused letter like J could be used to represent *, like what you said. On the Rust side, the scores in the amino acid matrix can be cloned and changed, but this is not yet exposed in the C API. Without changing the scores, matches and mismatches with J incur a score of -128.
  2. A letter not part of the original 20 amino acids but still has predefined scores can be used. For example, * can be translated to X.
  3. Require letters to be mapped to numerical values 0-20, then allow block aligner to align numerical strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants