Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor calculate_gather_stats to avoid calculating intersections twice #3196

Closed
ctb opened this issue Jun 9, 2024 · 3 comments · Fixed by #3352
Closed

refactor calculate_gather_stats to avoid calculating intersections twice #3196

ctb opened this issue Jun 9, 2024 · 3 comments · Fixed by #3352
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Jun 9, 2024

Punted from #3193 -

Right now we are calculating the intersection twice, once in disk_revindex.rs and once in calculate_gather_stats in revindex/mod.rs. This is unnecessary, of course. But the function signature for calculate_gather_stats would need to change to take the intersection as an argument. We could:

  • keep calculating it twice, just for simplicity;

  • change calculate_gather_stats to take an optional intersection, and calculate it if not provided;

  • change calculate_gather_stats to require an intersection.

We opted to calculate it twice in #3193, just to get the bug fixed; but this is something to consider.

@ctb ctb added the rust label Jun 9, 2024
@ctb
Copy link
Contributor Author

ctb commented Oct 12, 2024

with #3342, we are also downsampling twice. So this is could have some performance implications.

@ctb
Copy link
Contributor Author

ctb commented Oct 12, 2024

simplest idea here would probably be to optionally pass in the intersect as a Some(...).

@ctb
Copy link
Contributor Author

ctb commented Oct 14, 2024

see #473, a PR into this 'un, for a solution that returns the intersect, and also requires that query be the correct scaled.

ctb added a commit that referenced this issue Oct 15, 2024
…ather` bug around `scaled`. (#3342)

This PR does five things:

First, it swaps the implementation of `KmerMinHash::downsample_max_hash`
with `KmerMinHash::downsample_scaled`, and the same for
`KmerMinHashBTree`. Previously a call to `downsample_scaled` calculated
the right `max_hash` from `scaled`, then called `downsample_max_hash`,
which then converted `max_hash` back to `scaled`. This reverses the
logic so that (slightly) less work is done and, more importantly, the
code is a bit more straightforward.

Second, it changes the `downsample_*` functions so that they do not
downsample when no downsampling is needed. As part of this the method
signatures are changed to take an object, rather than a reference. This
lets the functions return an unmodified `KmerMinHash` when no
downsampling is needed.

Third, it turns out the `downsample_*` functions didn't check to make
sure that the new `scaled` value was larger than the old one, i.e. they
didn't prevent upsampling. That check was added and a new error,
`CannotUpsampleScaled`, was added to sourmash core.

Fourth, this uncovered a bug in `RevIndex::gather` where the query was
downsampled to the match, even when the match was lower scaled. This PR
rejiggers the code so that downsampling is done appropriately in the
`gather` and `calculate_gather_stats`. Since `RevIndex::gather` isn't
used in the the sourmash CLI, the bug only presented in the test suite
and in the branchwater plugin; see
sourmash-bio/sourmash_plugin_branchwater#468
and
sourmash-bio/sourmash_plugin_branchwater#467,
where a fastmultigather test had to be fixed because of the incorrect
scaled values output by `RevIndex::gather`.

Fifth, it includes #3348
from @luizirber, which adds a `Signature::try_into()` to `KmerMinHash`
to support the elimination of some clones.

Because of the method signature change for the `downsample_*` functions,
the sourmash-core version needs to be bumped to a new major version,
0.16.0.

It's been a fun journey! 😅 

Fixes #3343

Some notes on further changes and performance implications:

As a consequence of the `RevIndex::gather` changes, redundant
downsampling has to be done in `RevIndex::gather` and
`calculate_gather_stats`, unless we want to change the method signature
of `calculate_gather_stats`. I decided the PR was big enough that I
didn't want to do that in addition. It should not affect most use cases
where `scaled` is the same, and we will see if it results in any
slowdowns over in the branchwater plugin. See
#3196 for an issue on all
of this.

We could also just insist that the query scaled is the one to pay
attention to, per #2951.
This would simplify the code in Python-land as well.

Overall, the performance implications of this PR are not clear.
Previously downsampling was being done even when it wasn't needed, so
this may speed things up quite a lot for our typical use case! On the
other hand, redundant downsampling will happen in cases where there are
scaled mismatches. We just need to benchmark it, I think.

Some preliminary benchmarking reported in
sourmash-bio/sourmash_plugin_branchwater#430 (comment)
suggests that fastgather is now much more memory effficient 🎉 so that's
good!

TODO:
- [x] resolve the scaled mismatch stuff. do we return an `Err` or what
if the downsampling can't be performed?
- [x] update PR description
- [x] add more tests for downsampling, and maybe for gather
- [x] play with this code over in the branchwater plugin too!
sourmash-bio/sourmash_plugin_branchwater#467

---------

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@ctb ctb closed this as completed in #3352 Oct 15, 2024
ctb added a commit that referenced this issue Oct 15, 2024
…ing (#3352)

This PR builds on the refactoring in #3342 to do less downsampling and
also avoids doing intersections twice (per #3196).

Benchmarks in
sourmash-bio/sourmash_plugin_branchwater#471 are
pretty astonishing...

Fixes #3196

---------

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant