-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Syntax for modifying the nucleotide scoring matrix #27
Comments
Thanks for trying this crate out! You do not need to modify the To use your newly created a.align(&q, &r, &NW1, gaps, min_block_size..=max_block_size, 0); to a.align(&q, &r, &scoring_matrix, gaps, min_block_size..=max_block_size, 0); This is where you actually run the aligner, so you need to provide a concrete instance of a scoring matrix. The |
Oh perfect. Thanks for this, Daniel! That makes a lot of sense. I've written I'll give a small example to get at what I'm seeing. Let's say we're comparing sequence A and sequence B, as follows:
With a simple NW1-style scoring matrix, where gap open is -2 and extend is -1, I imagine the sequences would be aligned something like this:
Does that look right to you? If so, I'd expect the score to reflect 6 mismatches (-6), a gap open (-2), 3 gaps (-3), and 11 matches (+11), for a final score of 0. However, if we want to update the scoring matrix such that mismatches with N's score to zero, would that mean that the aligner will always preference mismatching with Ns? With the above example, would it then do something like:
According to the updating scoring matrix, this would give 8 matches, a gap open, and 3 gap extends, for an improved final score of 3. For much longer sequences, I suspect this would inflate the apparent score for sequences with N's substantially. In essence, I've achieved exactly what I wanted to avoid: a distance matrix that reflects more about the presence of N's than it does about the similarity between two sequences. That said, I still think you've built an API that could help solve this! Is there a way you might suggest handling this situation in Thanks for taking the time to help! |
Interesting scenario! I would probably do something like: increasing the match score (eg., +3) and decreasing the mismatch score (eg., -3) for ATCG and then still penalize mismatches with N slightly (eg., -1)? Position-specific gap penalties is supported and it offers maximum flexibility, but its a little more involved to use. |
Hi Daniel,
I confess I haven't quite figured out how to customize scoring for each nucleotide character, but I think I'm getting close. As you'll recall from my issue on
triple_accel
, I'm looking to compute nucleotide distances for pairs of sequences that ignore characters other than A, T, G, and C. Do I have it right that you can control that inblock-aligner
with the Matrix trait'sset
method? For example:If so, could you provide some guidance on how to use an updated
NucMatrix
in this portion of your readme example?Thanks for the advice and for the excellent crate!
--Nick
The text was updated successfully, but these errors were encountered: