Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sketching performance for DNA #865

Merged
merged 3 commits into from
Jan 26, 2020
Merged

Conversation

kloetzl
Copy link
Contributor

@kloetzl kloetzl commented Jan 26, 2020

In #860 I described a performance issue in the sketching of DNA sequences: characters were repeatedly rechecked for their validity. The approach by @luizirber vastly speeds up the checking of characters (#861). However, most characters were still checked up to k times. Also, some valid k-mers were now missed due to a small logic error.

This pull request comes in three parts.

  1. An initially failing unit test, showcasing the regression introduced in a07dd85.
  2. Avoid rechecking valid characters.
  3. Remove fast path as it was slowing us down.

Here are the benchmarks:

 add_sequence/valid      time:   [4.9079 ms 4.9765 ms 5.0571 ms]                               
                        change: [-7.5285% -6.2820% -4.8922%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking add_sequence/lowercase: Collecting 10 samples in estimated 5.1661 s (1045 iterati                                                                                              add_sequence/lowercase  time:   [4.9820 ms 5.1157 ms 5.3481 ms]
                        change: [-9.3013% -6.2065% -2.4192%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking add_sequence/invalid kmers: Collecting 10 samples in estimated 5.1394 s (1540 ite                                                                                              add_sequence/invalid kmers                        
                        time:   [3.2999 ms 3.3375 ms 3.3611 ms]
                        change: [-32.280% -31.194% -29.966%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
Benchmarking add_sequence/force with valid kmers: Collecting 10 samples in estimated 5.1789 s                                                                                               add_sequence/force with valid kmers                        
                        time:   [4.8650 ms 4.9078 ms 4.9668 ms]
                        change: [-11.240% -7.9411% -4.2589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
  1 (10.00%) high severe

This code is not particularly Rusty as I don't know the idioms. Feel free to request changes. Also I don't have all python dependencies installed, so please run the checks again.

Best,
Fabian

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

The commit a07dd85 erroneously skips
erroneous kmers. This commit adds a test for that regression.
Let's consider a DNA sequence of length L with V valid characters [ACGT]
and I invalid chars. For each k-mer we need to check if it consists
of solely valid characters. Done naïvely that would take time O(L * k).
With a07dd85 that time was reduced
to O(V * k + I). (However, that commit also introduced a regression as it
skipped some valid k-mers.) For any real-world sequence We can assume
V ≫ I. This commit implements a O(V + I * k) algorithm. While O(V + I)
should be possible, it is more complex and the current method is fast
enough, for now.
Previously, this code had a fast and a slow path. However, with the
last commit the slow path became so much faster that the distinction is
unnecessary now. Removing the fast path simplifies the code and makes
it faster by a few more percentage points.
@codecov
Copy link

codecov bot commented Jan 26, 2020

Codecov Report

Merging #865 into master will increase coverage by 12.87%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #865       +/-   ##
===========================================
+ Coverage   78.39%   91.26%   +12.87%     
===========================================
  Files          94       69       -25     
  Lines        7294     4959     -2335     
===========================================
- Hits         5718     4526     -1192     
+ Misses       1576      433     -1143
Flag Coverage Δ
#rusttests ?
Impacted Files Coverage Δ
src/core/src/index/sbt/mhbt.rs
src/core/src/index/bigsi.rs
src/core/src/errors.rs
src/core/src/ffi/utils.rs
src/core/tests/signature.rs
src/core/src/lib.rs
src/core/src/ffi/signature.rs
src/core/src/index/linear.rs
src/core/src/sketch/ukhs.rs
src/core/src/wasm.rs
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7791878...10bec82. Read the comment docs.

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, thanks @kloetzl! Just the performance improvements would already be perfect, but fixing my broken code and adding a test are the cherry on top =]

There are some formatting and one code check from clippy that I added as suggestions (I can't commit to your branch), but other than that LGTM!

(and thanks for keeping the comment, took me some time to draw it in ASCII 😅 )

@luizirber luizirber merged commit f9de09b into sourmash-bio:master Jan 26, 2020
@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 26, 2020

(and thanks for keeping the comment, took me some time to draw it in ASCII 😅 )

At some point during the refactoring one of the unit tests failed, and I was like what, why?. I had to consult the diagram to see I got the boundaries wrong.

luizirber added a commit that referenced this pull request Jan 26, 2020
@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 26, 2020

Seeing as you already merged the PR, it might be easiest if you would create a commit with the recommended changes yourself. And you were quicker than I could type.

@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 26, 2020

Thanks for merging, btw. 👍

@luizirber
Copy link
Member

luizirber commented Jan 26, 2020

Seeing as you already merged the PR, it might be easiest if you would create a commit with the recommended changes yourself. And you were quicker than I could type.

Since #856 is about to be merged too I thought it was easier to fix it there =]

Thanks for merging, btw. +1

Oh, BTW: I'm adding contributors in #837, so if you're OK with being an author in the sourmash 3.x paper can you paste your ORCID ID here (so I can add it there)?

@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 26, 2020

Oh, BTW: I'm adding contributors in #837, so if you're OK with being an author in the sourmash 3.0 paper can you paste your ORCID ID here (so I can add it there)?

Sure, I would be happy about that. But I don't have an ORCID id.

@ctb
Copy link
Contributor

ctb commented Jan 26, 2020

Thank you for all your help! You can get one at https://orcid.org/register - I think our journal of choice (JOSS) requires them, unfortunately :(

ctb added a commit that referenced this pull request Jan 26, 2020
* refactor downsampling code to no longer introspect exception text

* clean up _similarity_downsample

* [WIP] oxidize downsampling in various similarity functions (#863)

* added downsample bool option to count_common and compare in rust code

* implement downsample functionality in rust

* refactoring sbtmh, step 1

* refactoring sbtmh, step 2

* update 'sourmash watch' to use SBT search fn

* refactoring out unnecessary functionality

* tackle downsampling in similarity with abundance; fix various rust issues

* simplify containment functions

* add some tests for the basic bulk behavior

* another basic test

* fix rust tests

* fix clippy lints

* bump core version

* switch mismatch error to reference scaled, not max_hash

* add docstrings

* slightly more understandble if statement :)

* fix cosine similarity calculation, AND fix bug in new rust downsampling code for compare

* checks that downsampling is doing the right thing

* update to make it clear we're calculating the angular similarity

* fmt and clippy fixes for #865

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 27, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants