Improve sketching performance for DNA #865

kloetzl · 2020-01-26T12:37:25Z

In #860 I described a performance issue in the sketching of DNA sequences: characters were repeatedly rechecked for their validity. The approach by @luizirber vastly speeds up the checking of characters (#861). However, most characters were still checked up to k times. Also, some valid k-mers were now missed due to a small logic error.

This pull request comes in three parts.

An initially failing unit test, showcasing the regression introduced in a07dd85.
Avoid rechecking valid characters.
Remove fast path as it was slowing us down.

Here are the benchmarks:

 add_sequence/valid      time:   [4.9079 ms 4.9765 ms 5.0571 ms]                               
                        change: [-7.5285% -6.2820% -4.8922%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking add_sequence/lowercase: Collecting 10 samples in estimated 5.1661 s (1045 iterati                                                                                              add_sequence/lowercase  time:   [4.9820 ms 5.1157 ms 5.3481 ms]
                        change: [-9.3013% -6.2065% -2.4192%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking add_sequence/invalid kmers: Collecting 10 samples in estimated 5.1394 s (1540 ite                                                                                              add_sequence/invalid kmers                        
                        time:   [3.2999 ms 3.3375 ms 3.3611 ms]
                        change: [-32.280% -31.194% -29.966%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
Benchmarking add_sequence/force with valid kmers: Collecting 10 samples in estimated 5.1789 s                                                                                               add_sequence/force with valid kmers                        
                        time:   [4.8650 ms 4.9078 ms 4.9668 ms]
                        change: [-11.240% -7.9411% -4.2589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
  1 (10.00%) high severe

This code is not particularly Rusty as I don't know the idioms. Feel free to request changes. Also I don't have all python dependencies installed, so please run the checks again.

Best,
Fabian

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

The commit a07dd85 erroneously skips erroneous kmers. This commit adds a test for that regression.

Let's consider a DNA sequence of length L with V valid characters [ACGT] and I invalid chars. For each k-mer we need to check if it consists of solely valid characters. Done naïvely that would take time O(L * k). With a07dd85 that time was reduced to O(V * k + I). (However, that commit also introduced a regression as it skipped some valid k-mers.) For any real-world sequence We can assume V ≫ I. This commit implements a O(V + I * k) algorithm. While O(V + I) should be possible, it is more complex and the current method is fast enough, for now.

Previously, this code had a fast and a slow path. However, with the last commit the slow path became so much faster that the distinction is unnecessary now. Removing the fast path simplifies the code and makes it faster by a few more percentage points.

codecov · 2020-01-26T12:43:47Z

Codecov Report

Merging #865 into master will increase coverage by 12.87%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master     #865       +/-   ##
===========================================
+ Coverage   78.39%   91.26%   +12.87%     
===========================================
  Files          94       69       -25     
  Lines        7294     4959     -2335     
===========================================
- Hits         5718     4526     -1192     
+ Misses       1576      433     -1143

Flag	Coverage Δ
#rusttests	`?`

Impacted Files	Coverage Δ
src/core/src/index/sbt/mhbt.rs
src/core/src/index/bigsi.rs
src/core/src/errors.rs
src/core/src/ffi/utils.rs
src/core/tests/signature.rs
src/core/src/lib.rs
src/core/src/ffi/signature.rs
src/core/src/index/linear.rs
src/core/src/sketch/ukhs.rs
src/core/src/wasm.rs
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7791878...10bec82. Read the comment docs.

src/core/src/sketch/minhash.rs

src/core/tests/minhash.rs

luizirber

wow, thanks @kloetzl! Just the performance improvements would already be perfect, but fixing my broken code and adding a test are the cherry on top =]

There are some formatting and one code check from clippy that I added as suggestions (I can't commit to your branch), but other than that LGTM!

(and thanks for keeping the comment, took me some time to draw it in ASCII 😅 )

kloetzl · 2020-01-26T19:19:15Z

(and thanks for keeping the comment, took me some time to draw it in ASCII 😅 )

At some point during the refactoring one of the unit tests failed, and I was like what, why?. I had to consult the diagram to see I got the boundaries wrong.

kloetzl · 2020-01-26T19:24:27Z

~~Seeing as you already merged the PR, it might be easiest if you would create a commit with the recommended changes yourself.~~ And you were quicker than I could type.

kloetzl · 2020-01-26T19:25:02Z

Thanks for merging, btw. 👍

luizirber · 2020-01-26T19:40:39Z

~~Seeing as you already merged the PR, it might be easiest if you would create a commit with the recommended changes yourself.~~ And you were quicker than I could type.

Since #856 is about to be merged too I thought it was easier to fix it there =]

Thanks for merging, btw. +1

Oh, BTW: I'm adding contributors in #837, so if you're OK with being an author in the sourmash 3.x paper can you paste your ORCID ID here (so I can add it there)?

kloetzl · 2020-01-26T19:43:44Z

Oh, BTW: I'm adding contributors in #837, so if you're OK with being an author in the sourmash 3.0 paper can you paste your ORCID ID here (so I can add it there)?

Sure, I would be happy about that. But I don't have an ORCID id.

ctb · 2020-01-26T19:51:40Z

Thank you for all your help! You can get one at https://orcid.org/register - I think our journal of choice (JOSS) requires them, unfortunately :(

* refactor downsampling code to no longer introspect exception text * clean up _similarity_downsample * [WIP] oxidize downsampling in various similarity functions (#863) * added downsample bool option to count_common and compare in rust code * implement downsample functionality in rust * refactoring sbtmh, step 1 * refactoring sbtmh, step 2 * update 'sourmash watch' to use SBT search fn * refactoring out unnecessary functionality * tackle downsampling in similarity with abundance; fix various rust issues * simplify containment functions * add some tests for the basic bulk behavior * another basic test * fix rust tests * fix clippy lints * bump core version * switch mismatch error to reference scaled, not max_hash * add docstrings * slightly more understandble if statement :) * fix cosine similarity calculation, AND fix bug in new rust downsampling code for compare * checks that downsampling is doing the right thing * update to make it clear we're calculating the angular similarity * fmt and clippy fixes for #865 Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>

kloetzl · 2020-01-27T07:10:42Z

Done: https://orcid.org/0000-0002-6930-0592

kloetzl added 3 commits January 26, 2020 12:27

Add difficult test case for dirty DNA

f292d2d

The commit a07dd85 erroneously skips erroneous kmers. This commit adds a test for that regression.

luizirber reviewed Jan 26, 2020

View reviewed changes

src/core/src/sketch/minhash.rs Show resolved Hide resolved

luizirber reviewed Jan 26, 2020

View reviewed changes

src/core/tests/minhash.rs Show resolved Hide resolved

luizirber reviewed Jan 26, 2020

View reviewed changes

src/core/tests/minhash.rs Show resolved Hide resolved

luizirber approved these changes Jan 26, 2020

View reviewed changes

luizirber merged commit f9de09b into sourmash-bio:master Jan 26, 2020

luizirber added a commit that referenced this pull request Jan 26, 2020

fmt and clippy fixes for #865

6ebd166

luizirber mentioned this pull request Feb 18, 2020

How to use low level functions? #489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sketching performance for DNA #865

Improve sketching performance for DNA #865

kloetzl commented Jan 26, 2020

codecov bot commented Jan 26, 2020 •

edited

Loading

luizirber left a comment •

edited

Loading

kloetzl commented Jan 26, 2020

kloetzl commented Jan 26, 2020 •

edited

Loading

kloetzl commented Jan 26, 2020

luizirber commented Jan 26, 2020 •

edited

Loading

kloetzl commented Jan 26, 2020

ctb commented Jan 26, 2020

kloetzl commented Jan 27, 2020

Improve sketching performance for DNA #865

Improve sketching performance for DNA #865

Conversation

kloetzl commented Jan 26, 2020

codecov bot commented Jan 26, 2020 • edited Loading

Codecov Report

luizirber left a comment • edited Loading

Choose a reason for hiding this comment

kloetzl commented Jan 26, 2020

kloetzl commented Jan 26, 2020 • edited Loading

kloetzl commented Jan 26, 2020

luizirber commented Jan 26, 2020 • edited Loading

kloetzl commented Jan 26, 2020

ctb commented Jan 26, 2020

kloetzl commented Jan 27, 2020

codecov bot commented Jan 26, 2020 •

edited

Loading

luizirber left a comment •

edited

Loading

kloetzl commented Jan 26, 2020 •

edited

Loading

luizirber commented Jan 26, 2020 •

edited

Loading