Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] fix bug in LCA_Database.downsample #2117

Merged
merged 1 commit into from
Jul 11, 2022
Merged

[MRG] fix bug in LCA_Database.downsample #2117

merged 1 commit into from
Jul 11, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 11, 2022

I discovered this bug while working on other things -

In the base LCA_Database implementation, a dictionary _hashval_to_idx is used to track which hash values (keys) belong to which set of signatures (individual signatures are referenced by integers, idx). The dictionary is created as a collections.defaultdict(set) ref so that when adding a new key, it automatically comes with a new set as a value - this permits the call

self._hashval_to_idx[hashval].add(idx)

in LCA_Database.insert(...).

The bug is that after LCA_Database.downsample(...) is called, _hashval_to_idx is recreated as a regular dictionary, which causes insert to fail.

This PR adds a new test, test_lca.py:test_api_create_insert_two_then_scale_then_add, that does a downsample and then does an insert. This test exposes the bug above, which the PR also fixes.

@ctb
Copy link
Contributor Author

ctb commented Jul 11, 2022

@ccbaumler can you take on reviewing this? Should be enough to verify that the description of the PR matches the code changes, and that the tests pass. No hurry. Thanks!

@codecov
Copy link

codecov bot commented Jul 11, 2022

Codecov Report

Merging #2117 (8ce3e3d) into latest (ebe14dd) will increase coverage by 7.38%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2117      +/-   ##
==========================================
+ Coverage   84.30%   91.69%   +7.38%     
==========================================
  Files         130       99      -31     
  Lines       15280    11004    -4276     
  Branches     2171     2171              
==========================================
- Hits        12882    10090    -2792     
+ Misses       2095      611    -1484     
  Partials      303      303              
Flag Coverage Δ
python 91.69% <100.00%> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/lca_db.py 93.28% <100.00%> (ø)
src/core/src/lib.rs
src/core/src/index/mod.rs
src/core/src/sketch/hyperloglog/estimators.rs
src/core/tests/storage.rs
src/core/tests/test.rs
src/core/src/storage.rs
src/core/src/ffi/mod.rs
src/core/src/index/revindex.rs
src/core/src/errors.rs
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebe14dd...8ce3e3d. Read the comment docs.

@ccbaumler
Copy link
Contributor

I'm on it!

@ccbaumler
Copy link
Contributor

@ctb Two questions:

  1. Why use new_hashvals = defaultdict(set) instead of new_hashvals = defaultdict(list)? Would the list data structure provide the indices for the hash values?
  2. In the function test_api_create_insert_two_then_scale_then_add(), should the other ksize variables and changes in scale be accounted for?

@ctb
Copy link
Contributor Author

ctb commented Jul 11, 2022

@ctb Two questions:

1. Why use `new_hashvals = defaultdict(set)` instead of `new_hashvals = defaultdict(list)`? Would the `list` data structure provide the indices for the hash values?

set is a good data structure to use for situations where you have a container of element and:

  • you don't care about ordering of elements;
  • you are only asking about presence/absence, not multiplicity (number of times something is seen);

and they are optimized for membership checks.

list, by contrast,

  • maintain order of elements
  • will store multiple identical elements

and they are not optimized for a membership check, but rather for iteration (for item in listx:) and indexing (item = listx[i]).

2. In the function `test_api_create_insert_two_then_scale_then_add()`, should the other `ksize` variables and changes in `scale` be accounted for?

this is hard to respond to comprehensively, because it enters a whole world of testing philosophy, but here are a few thoughts -

(1) a good rule for a FIRST test in a bugfix PR is one that replicates the bug precisely.
(2) I strongly believe that there is no interaction in this particular aspect of the code with ksize and scaled.
(3) generally speaking I usually want a specific reason to add a test - to increase code statement coverage, or branch coverage;

that all having been said, if you have reason to believe that I'm wrong - especially about point (2) - the ideal path forward would be to describe why you think we need additional tests, or -- even better! -- to write a test that breaks something that should work :).

tl;dr we can always write more tests, but it's good to have a specific reason for doing so, however nebulous.

@ctb
Copy link
Contributor Author

ctb commented Jul 11, 2022

(great questions!)

Copy link
Contributor

@ccbaumler ccbaumler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight! LGTM.

@ctb ctb merged commit 84f9f65 into latest Jul 11, 2022
@ctb ctb deleted the fix/lca_db_downsample branch July 11, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants