Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Syntax for modifying the nucleotide scoring matrix #27

Open
nrminor opened this issue Feb 7, 2024 · 3 comments
Open

Question: Syntax for modifying the nucleotide scoring matrix #27

nrminor opened this issue Feb 7, 2024 · 3 comments

Comments

@nrminor
Copy link

nrminor commented Feb 7, 2024

Hi Daniel,

I confess I haven't quite figured out how to customize scoring for each nucleotide character, but I think I'm getting close. As you'll recall from my issue on triple_accel, I'm looking to compute nucleotide distances for pairs of sequences that ignore characters other than A, T, G, and C. Do I have it right that you can control that in block-aligner with the Matrix trait's set method? For example:

let mut scoring_matrix = NucMatrix::new_simple(1, -1);
scoring_matrix.set(b'N', b'A', 0);
scoring_matrix.set(b'N', b'T', 0);
scoring_matrix.set(b'N', b'G', 0);
scoring_matrix.set(b'N', b'C', 0);

If so, could you provide some guidance on how to use an updated NucMatrix in this portion of your readme example?

// Note that PaddedBytes, Block, and Cigar can be initialized with sequence length
// and block size upper bounds and be reused later for shorter sequences, to avoid
// repeated allocations.
let r = PaddedBytes::from_bytes::<NucMatrix>(b"TTAAAAAAATTTTTTTTTTTT", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"TTTTTTTTAAAAAAATTTTTTTTT", max_block_size);

Thanks for the advice and for the excellent crate!
--Nick

@Daniel-Liu-c0deb0t
Copy link
Owner

Daniel-Liu-c0deb0t commented Feb 7, 2024

Thanks for trying this crate out!

You do not need to modify the PaddedBytes construction lines because they only need to know the type of the matrix (NucMatrix, AAMatrix, etc.). This affects how the alphabet is encoded internally in the PaddedBytes.

To use your newly created scoring_matrix, you need to change the following line:

a.align(&q, &r, &NW1, gaps, min_block_size..=max_block_size, 0);

to

a.align(&q, &r, &scoring_matrix, gaps, min_block_size..=max_block_size, 0);

This is where you actually run the aligner, so you need to provide a concrete instance of a scoring matrix. The NW1 scoring matrix is a predefined matrix with just +1 for matches and -1 for mismatches.

@nrminor
Copy link
Author

nrminor commented Feb 9, 2024

Oh perfect. Thanks for this, Daniel! That makes a lot of sense.

I've written block-aligner into my tool, but I've run into some trouble with my original idea of scoring mismatches with N as zero, and I wondered if I could ask for some of your insight as someone who has thought about this stuff much more.

I'll give a small example to get at what I'm seeing. Let's say we're comparing sequence A and sequence B, as follows:

>Sequence A
AAATGCCATCGCGTTGAA
>Sequence B
AAATGNNATCGNNNNNNNTAAA

With a simple NW1-style scoring matrix, where gap open is -2 and extend is -1, I imagine the sequences would be aligned something like this:

AAATGCCATCGCGT----TGAA
AAATGNNATCGNNNNNNNTAAA

Does that look right to you? If so, I'd expect the score to reflect 6 mismatches (-6), a gap open (-2), 3 gaps (-3), and 11 matches (+11), for a final score of 0.

However, if we want to update the scoring matrix such that mismatches with N's score to zero, would that mean that the aligner will always preference mismatching with Ns? With the above example, would it then do something like:

AAATGCCATCGCGTTGAA----
AAATGNNATCGNNNNNNNTAAA

According to the updating scoring matrix, this would give 8 matches, a gap open, and 3 gap extends, for an improved final score of 3. For much longer sequences, I suspect this would inflate the apparent score for sequences with N's substantially. In essence, I've achieved exactly what I wanted to avoid: a distance matrix that reflects more about the presence of N's than it does about the similarity between two sequences.

That said, I still think you've built an API that could help solve this! Is there a way you might suggest handling this situation in block-aligner? For example, is it possible to make the gap penalty lower on N's, but not on normal ATGC bases? Or would you just increase the upside for matches?

Thanks for taking the time to help!
--Nick

@Daniel-Liu-c0deb0t
Copy link
Owner

Daniel-Liu-c0deb0t commented Feb 9, 2024

Interesting scenario! I would probably do something like: increasing the match score (eg., +3) and decreasing the mismatch score (eg., -3) for ATCG and then still penalize mismatches with N slightly (eg., -1)?

Position-specific gap penalties is supported and it offers maximum flexibility, but its a little more involved to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants