Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update GatherResult to no longer store MinHashes #2950

Open
ctb opened this issue Jan 27, 2024 · 5 comments
Open

update GatherResult to no longer store MinHashes #2950

ctb opened this issue Jan 27, 2024 · 5 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jan 27, 2024

over in https://github.com/ctb/2024-calc-full-gather/ I have implemented a simple script that takes fastgather output (from https://github.com/sourmash-bio/sourmash_plugin_branchwater/) and turns it into full gather output without redoing the searches - it literally just trusts the rank and match information from fastgather completely, and calculates all the stats.

this was easier than I expected because of the very nice GatherResult refactoring that @bluegenes did a while back in #1955!

however it also revealed that #1955 probably added significantly to the memory footprint of gather, because the GatherResult dataclasses keep sketches in memory and they are retained throughout the full gather process.

I figured this out when I noticed that my calc-full-gather script was running out of memory in the same way that gather was running out of memory, and in ctb/2024-calc-full-gather@a09215e I fixed it by discarding the GatherResult objects after each result. It's now nice and low memory (if not exactly fast ;) - see #2943.

I am also wondering if perhaps PrefetchResult has the same problem in prefetch?

We should fix the gather code in sourmash to be lower memory.

We probably need to do some kind of regression testing that tracks memory usage and the like, too.

viz sourmash-bio/sourmash_plugin_branchwater#187, #2943.

@ctb
Copy link
Contributor Author

ctb commented Jan 28, 2024

note connection to suggestions at bottom of #416

@ctb
Copy link
Contributor Author

ctb commented Jan 30, 2024

I have questions - are the dataclasses caching results, or recomputing them multiple times?? -- and the people demand answers!

@ctb
Copy link
Contributor Author

ctb commented Jan 30, 2024

OK, #2962 tackles this for just sourmash gather and multigather

@ctb
Copy link
Contributor Author

ctb commented Jan 30, 2024

PrefetchResult is immediately released, so sourmash prefetch doesn't suffer from this problem.

looks like SearchResult may run afoul of this, however. But it's used in relatively minimal ways so far.

ctb added a commit that referenced this issue Jan 31, 2024
…#2962)

This is kind of a patch-fix for
#2950 for `sourmash
gather` specifically.

This PR changes `sourmash gather` and `sourmash multigather` so that
they no longer store any `GatherResult` objects, thus decreasing memory
usage substantially.

The solution is hacky at several levels, including storing a CSV file in
memory rather than writing it progressively. But I think it's an
important fix to get in, since `gather` is one of our main use cases and
it's causing people some problems (including me) :(.

The PR also changes `--save-matches` so that it writes out sketches as
they are encountered. This breaks semantic versioning a little bit
because the target file for `--save-matches` is opened before any
matches are found, and thus may be empty and may also overwrite files
unnecessarily.

Ultimately, a better fix is needed - probably one that changes up the
dataclasses so that they don't store MinHashes - but such a fix is
beyond me at the moment.

## benchmarking

with latest @ e2c199f: 645 MB

```
        Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv"
        User time (seconds): 48.51
        System time (seconds): 1.15
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.91
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 644900
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 156
        Minor (reclaiming a frame) page faults: 254494
        Voluntary context switches: 2412
        Involuntary context switches: 2749
        Swaps: 0
        File system inputs: 31488
        File system outputs: 64
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
```
with this branch: 215 MB
```
        Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv"
        User time (seconds): 43.38
        System time (seconds): 0.89
        Percent of CPU this job got: 97%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:45.58
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 215560
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 773
        Minor (reclaiming a frame) page faults: 148722
        Voluntary context switches: 3884
        Involuntary context switches: 6174
        Swaps: 0
        File system inputs: 151648
        File system outputs: 160
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
```
@ctb
Copy link
Contributor Author

ctb commented Jan 31, 2024

#2962 addresses the memory usage, but not the underlying problem. From the PR:

Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant