-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sketching performance for DNA #865
Conversation
The commit a07dd85 erroneously skips erroneous kmers. This commit adds a test for that regression.
Let's consider a DNA sequence of length L with V valid characters [ACGT] and I invalid chars. For each k-mer we need to check if it consists of solely valid characters. Done naïvely that would take time O(L * k). With a07dd85 that time was reduced to O(V * k + I). (However, that commit also introduced a regression as it skipped some valid k-mers.) For any real-world sequence We can assume V ≫ I. This commit implements a O(V + I * k) algorithm. While O(V + I) should be possible, it is more complex and the current method is fast enough, for now.
Previously, this code had a fast and a slow path. However, with the last commit the slow path became so much faster that the distinction is unnecessary now. Removing the fast path simplifies the code and makes it faster by a few more percentage points.
Codecov Report
@@ Coverage Diff @@
## master #865 +/- ##
===========================================
+ Coverage 78.39% 91.26% +12.87%
===========================================
Files 94 69 -25
Lines 7294 4959 -2335
===========================================
- Hits 5718 4526 -1192
+ Misses 1576 433 -1143
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, thanks @kloetzl! Just the performance improvements would already be perfect, but fixing my broken code and adding a test are the cherry on top =]
There are some formatting and one code check from clippy that I added as suggestions (I can't commit to your branch), but other than that LGTM!
(and thanks for keeping the comment, took me some time to draw it in ASCII 😅 )
At some point during the refactoring one of the unit tests failed, and I was like what, why?. I had to consult the diagram to see I got the boundaries wrong. |
|
Thanks for merging, btw. 👍 |
Since #856 is about to be merged too I thought it was easier to fix it there =]
Oh, BTW: I'm adding contributors in #837, so if you're OK with being an author in the sourmash 3.x paper can you paste your ORCID ID here (so I can add it there)? |
Sure, I would be happy about that. But I don't have an ORCID id. |
Thank you for all your help! You can get one at https://orcid.org/register - I think our journal of choice (JOSS) requires them, unfortunately :( |
* refactor downsampling code to no longer introspect exception text * clean up _similarity_downsample * [WIP] oxidize downsampling in various similarity functions (#863) * added downsample bool option to count_common and compare in rust code * implement downsample functionality in rust * refactoring sbtmh, step 1 * refactoring sbtmh, step 2 * update 'sourmash watch' to use SBT search fn * refactoring out unnecessary functionality * tackle downsampling in similarity with abundance; fix various rust issues * simplify containment functions * add some tests for the basic bulk behavior * another basic test * fix rust tests * fix clippy lints * bump core version * switch mismatch error to reference scaled, not max_hash * add docstrings * slightly more understandble if statement :) * fix cosine similarity calculation, AND fix bug in new rust downsampling code for compare * checks that downsampling is doing the right thing * update to make it clear we're calculating the angular similarity * fmt and clippy fixes for #865 Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
In #860 I described a performance issue in the sketching of DNA sequences: characters were repeatedly rechecked for their validity. The approach by @luizirber vastly speeds up the checking of characters (#861). However, most characters were still checked up to k times. Also, some valid k-mers were now missed due to a small logic error.
This pull request comes in three parts.
Here are the benchmarks:
This code is not particularly Rusty as I don't know the idioms. Feel free to request changes. Also I don't have all python dependencies installed, so please run the checks again.
Best,
Fabian
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?