Support alignment of protein sequences containing "*" #1

matchy233 · 2021-09-07T09:23:25Z

I'm using the C API of block-aligner to align protein sequences from UniProt database. There are *s in some protein sequences. Currently using block-aligner to align sequences containing * will cause a Segmentation Fault. Although the users can resolve it by mapping * to other supported chars, it would be nice if we can support * internally! :)

The text was updated successfully, but these errors were encountered:

Daniel-Liu-c0deb0t · 2021-09-08T03:24:43Z

I'm not sure if * will every be directly supported internally. It will always have to be mapped to some character that fits within the scoring matrix, so SIMD lookups can be done. Right now, the amino acid matrix supports alphabetical characters A-Z.

There are a couple of ways this could be solved:

An unused letter like J could be used to represent *, like what you said. On the Rust side, the scores in the amino acid matrix can be cloned and changed, but this is not yet exposed in the C API. Without changing the scores, matches and mismatches with J incur a score of -128.
A letter not part of the original 20 amino acids but still has predefined scores can be used. For example, * can be translated to X.
Require letters to be mapped to numerical values 0-20, then allow block aligner to align numerical strings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support alignment of protein sequences containing "*" #1

Support alignment of protein sequences containing "*" #1

matchy233 commented Sep 7, 2021

Daniel-Liu-c0deb0t commented Sep 8, 2021 •

edited

Loading

Support alignment of protein sequences containing "*" #1

Support alignment of protein sequences containing "*" #1

Comments

matchy233 commented Sep 7, 2021

Daniel-Liu-c0deb0t commented Sep 8, 2021 • edited Loading

Daniel-Liu-c0deb0t commented Sep 8, 2021 •

edited

Loading