[MRG] fix bug in `LCA_Database.downsample` #2117

ctb · 2022-07-11T13:15:06Z

I discovered this bug while working on other things -

In the base LCA_Database implementation, a dictionary _hashval_to_idx is used to track which hash values (keys) belong to which set of signatures (individual signatures are referenced by integers, idx). The dictionary is created as a collections.defaultdict(set) ref so that when adding a new key, it automatically comes with a new set as a value - this permits the call

self._hashval_to_idx[hashval].add(idx)

in LCA_Database.insert(...).

The bug is that after LCA_Database.downsample(...) is called, _hashval_to_idx is recreated as a regular dictionary, which causes insert to fail.

This PR adds a new test, test_lca.py:test_api_create_insert_two_then_scale_then_add, that does a downsample and then does an insert. This test exposes the bug above, which the PR also fixes.

ctb · 2022-07-11T13:15:44Z

@ccbaumler can you take on reviewing this? Should be enough to verify that the description of the PR matches the code changes, and that the tests pass. No hurry. Thanks!

codecov · 2022-07-11T13:23:01Z

Codecov Report

Merging #2117 (8ce3e3d) into latest (ebe14dd) will increase coverage by 7.38%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2117      +/-   ##
==========================================
+ Coverage   84.30%   91.69%   +7.38%     
==========================================
  Files         130       99      -31     
  Lines       15280    11004    -4276     
  Branches     2171     2171              
==========================================
- Hits        12882    10090    -2792     
+ Misses       2095      611    -1484     
  Partials      303      303

Flag	Coverage Δ
python	`91.69% <100.00%> (ø)`
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/lca/lca_db.py	`93.28% <100.00%> (ø)`
src/core/src/lib.rs
src/core/src/index/mod.rs
src/core/src/sketch/hyperloglog/estimators.rs
src/core/tests/storage.rs
src/core/tests/test.rs
src/core/src/storage.rs
src/core/src/ffi/mod.rs
src/core/src/index/revindex.rs
src/core/src/errors.rs
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebe14dd...8ce3e3d. Read the comment docs.

ccbaumler · 2022-07-11T13:23:50Z

I'm on it!

ccbaumler · 2022-07-11T15:27:33Z

@ctb Two questions:

Why use new_hashvals = defaultdict(set) instead of new_hashvals = defaultdict(list)? Would the list data structure provide the indices for the hash values?
In the function test_api_create_insert_two_then_scale_then_add(), should the other ksize variables and changes in scale be accounted for?

ctb · 2022-07-11T15:36:28Z

@ctb Two questions:

1. Why use `new_hashvals = defaultdict(set)` instead of `new_hashvals = defaultdict(list)`? Would the `list` data structure provide the indices for the hash values?

set is a good data structure to use for situations where you have a container of element and:

you don't care about ordering of elements;
you are only asking about presence/absence, not multiplicity (number of times something is seen);

and they are optimized for membership checks.

list, by contrast,

maintain order of elements
will store multiple identical elements

and they are not optimized for a membership check, but rather for iteration (for item in listx:) and indexing (item = listx[i]).

2. In the function `test_api_create_insert_two_then_scale_then_add()`, should the other `ksize` variables and changes in `scale` be accounted for?

this is hard to respond to comprehensively, because it enters a whole world of testing philosophy, but here are a few thoughts -

(1) a good rule for a FIRST test in a bugfix PR is one that replicates the bug precisely.
(2) I strongly believe that there is no interaction in this particular aspect of the code with ksize and scaled.
(3) generally speaking I usually want a specific reason to add a test - to increase code statement coverage, or branch coverage;

that all having been said, if you have reason to believe that I'm wrong - especially about point (2) - the ideal path forward would be to describe why you think we need additional tests, or -- even better! -- to write a test that breaks something that should work :).

tl;dr we can always write more tests, but it's good to have a specific reason for doing so, however nebulous.

ctb · 2022-07-11T15:36:54Z

(great questions!)

ccbaumler

Thanks for the insight! LGTM.

test insert after downsample for LCA_Database

8ce3e3d

ccbaumler approved these changes Jul 11, 2022

View reviewed changes

ctb merged commit 84f9f65 into latest Jul 11, 2022

ctb deleted the fix/lca_db_downsample branch July 11, 2022 15:47

ctb mentioned this pull request Jul 19, 2022

Draft release notes for v4.4.2 #2129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] fix bug in `LCA_Database.downsample` #2117

[MRG] fix bug in `LCA_Database.downsample` #2117

ctb commented Jul 11, 2022

ctb commented Jul 11, 2022

codecov bot commented Jul 11, 2022 •

edited

Loading

ccbaumler commented Jul 11, 2022

ccbaumler commented Jul 11, 2022

ctb commented Jul 11, 2022

ctb commented Jul 11, 2022

ccbaumler left a comment

[MRG] fix bug in LCA_Database.downsample #2117

[MRG] fix bug in LCA_Database.downsample #2117

Conversation

ctb commented Jul 11, 2022

ctb commented Jul 11, 2022

codecov bot commented Jul 11, 2022 • edited Loading

Codecov Report

ccbaumler commented Jul 11, 2022

ccbaumler commented Jul 11, 2022

ctb commented Jul 11, 2022

ctb commented Jul 11, 2022

ccbaumler left a comment

Choose a reason for hiding this comment

[MRG] fix bug in `LCA_Database.downsample` #2117

[MRG] fix bug in `LCA_Database.downsample` #2117

codecov bot commented Jul 11, 2022 •

edited

Loading