Very subtle differences from minimap2 #75

rob-p · 2024-08-17T23:17:50Z

I'm just trying to track down the very subtle differences in aligning a dataset between the minimap2-rs bindings and what is output by the reference minimap2 implementation.

Mostly everything is the same, but I noticed some small differences (mostly in the number of supplementary alignments so far, so it's unlikely to affect anything important). However, it raises a broader issue about the degree to which reproducibility might be expected.

For example, one thing I noticed is that the current codebase has the ability to map a read against the index but (1) it doesn't take as input (and therefore doesn't consider) the quality values of the read. What would be required to have a map_with_qual function that also takes the quality string? Also I'm not sure it's related to my issue but (2) the map function also ignores the read name. However, the --seed parameter of minimap2 mentions this is used in tie breaking procedures:

--seed INT | Integer seed for randomizing equally best hits. Minimap2 hashes INT and read name when choosing between equally best hits. [11]

is this something that is done at the level of an ffi function currently exposed, and if so, what would be required to have another map_with_name_and_qual function that takes all 3 of these slices as input?

Anyway, these are very minor details, and so far, using minimap2-rs has been a breeze! Check out the dev branch of oarfish if you want to see how we're currently using it; specifically here.

--Rob

The text was updated successfully, but these errors were encountered:

jguhlin · 2024-08-20T02:28:22Z

I don't think quality scores are used in minimap2 for mapping.

The mapping functions are here:
https://github.com/lh3/minimap2/blob/8170693de39b667d11c8931d343c94a23a7690d2/minimap.h#L353

The second function is mm_map_frag which has qlens argument, but I believe that is query lengths, rather than quality lengths.

Happy to be wrong, but I don't see it.

As for passing in query name, I can fix that. I think I was just trying to minimize string copies everywhere, and conversion to/from CStr.

rob-p · 2024-08-20T20:07:53Z

Hi @jguhlin,

Thanks for the feedback on this. As I said, I’m not certain what the exact causes of the small discrepancies are. Essentially, what I observe is that out of 10M (simulated) long reads, both native minimap2 and minimap2-rs (with compatible settings) return the same number of mapped reads. The difference I observe is that standard minimap2 reports approximately 6 more supplementary alignments than similarly-configured minimap2-rs. This discrepancy is so small (and the assessed results using these mappings for quantification so similar), that I do not have any major concern. Yet, it’s strange to me that there should be any difference once the seed is set to be the same, as I thought the algorithm was deterministic (and, indeed, I observe exactly the same numbers under multiple runs of either standard minimap2 or using the minimap2-rs library - they consistently differ by this tiny amount, but are perfectly replicable over many runs).

I also agree with your design choice around the map function. Perhaps the best thing to do here would be to add map_with_name or some such, which would be a separate function you could call if you wanted to invoke the corresponding function in minimap2. The minimap2 manpage mentions that when certain chains are tied, the seed is hashed with the query name to do tie breaking. It’s a long shot, but perhaps this could explain the small differences we observe? The specific set of reads can be downloaded from this link if you’re curious what they are.

—Rob

rob-p · 2024-08-25T01:11:11Z

Hi @jguhlin,

Ok, so I've figured this out and it is indeed because of cases where there are chain ties and the hashing of the name breaks them in a way that makes certain alignments supplementary and others not. I've been able to achieve exact concordance with command-line minimap2. To do this, I've added a new function map_with_name to minimap2-rs that is the same as map but which also includes the query name. Right now, the relevant code is duplicated, but you may wish to refactor to minimize such duplication. I'll make a PR shortly and mention this issue. Once it's merged, feel free to close this! Also, though the Aligner clone issue still needs some love, I'd actually really appreciate a tagged release soon, as my makeshift solution for that works well right now and I'd like to tag a new release of oarfish with the raw read alignment capability.

Thanks!
Rob

jguhlin · 2024-11-27T21:33:32Z

So this should be fixed now, but if not, let's work on a test for it and I'll try to figure out what went wrong. Thanks for all your work on this!

rob-p mentioned this issue Aug 24, 2024

Question: How to determine supplementary alignments` lh3/minimap2#1241

Closed

rob-p mentioned this issue Aug 25, 2024

Alignment score #80

Merged

jguhlin closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very subtle differences from minimap2 #75

Very subtle differences from minimap2 #75

rob-p commented Aug 17, 2024

jguhlin commented Aug 20, 2024

rob-p commented Aug 20, 2024

rob-p commented Aug 25, 2024

jguhlin commented Nov 27, 2024

Very subtle differences from minimap2 #75

Very subtle differences from minimap2 #75

Comments

rob-p commented Aug 17, 2024

jguhlin commented Aug 20, 2024

rob-p commented Aug 20, 2024

rob-p commented Aug 25, 2024

jguhlin commented Nov 27, 2024