Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract storage into separate class; split Hashtable into Hashtable and Hashgraph. #1495

Merged
merged 28 commits into from
Nov 3, 2016

Conversation

ctb
Copy link
Member

@ctb ctb commented Oct 30, 2016

This PR separates the storage details of hashes from the actual hash function. It should support both more types of storage (4-bit and two byte CountMin sketches? maybe even exact storage?) and multiple hash functions.

It also separates out table and graph operations, making it easier to split class defn's between functionality that needs reversible hashing (graph operations) and functions that don't (table functions). Ultimately this can lead to new CPython objects (maybe Nodetable and Counttable?) that support k > 32 while the Nodegraph and Countgraph CPython objects would remain limited to k <= 32.

I tried to use templates instead of composition but couldn't get the various friend & inheritance declarations to work. Suggestions welcome.

No discernable performance impact was observed in some basic benchmarking.

Built on #1494. Might be a way to do #1490 (irreversible hashing for k > 32).

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

@ctb
Copy link
Member Author

ctb commented Oct 30, 2016

@betatim @luizirber I am having linker symbol resolution issues. Any suggestions?

@ctb
Copy link
Member Author

ctb commented Oct 30, 2016

Got it to compile!

Comments would be appreciated. Is it worth following through on this and fixing up the load and save code?

@ctb ctb changed the base branch from refactor/extract_hash to master October 31, 2016 14:42
@ctb
Copy link
Member Author

ctb commented Nov 1, 2016

All the tests pass!

@betatim @luizirber @standage ready for an initial review!

@codecov-io
Copy link

codecov-io commented Nov 1, 2016

Current coverage is 95.80% (diff: 100%)

No coverage report found for master at 4e5a234.

Powered by Codecov. Last update 4e5a234...25179ed

@ctb
Copy link
Member Author

ctb commented Nov 3, 2016

Ready for review! @betatim @standage

@betatim
Copy link
Member

betatim commented Nov 3, 2016

Mostly good 😄

One pedantic comment: all the variables that used to be of type Hashtable but now are Hashgraphs are still called hashtable :-/ It'll start getting confusing when Hashtables are used more why the variable called hashtable actually contains a graph...

I only spotted one used of Hashtable directly, everyone else goes through Hashgraph. Makes me wonder if the other stop_bf related things should continue to be Hashtables as well. This is where I'll resume the reviewing.

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few minor (also pedantic) comments about the class design. But I must say that being able to make changes of this scale confidently is a HUGE testament to the robustness of the khmer test suite!

std::vector<uint64_t> _tablesizes;
size_t _n_tables;
uint64_t _occupied_bins;
uint64_t _n_unique_kmers;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why duplicate these class members? Isn't the idea of inheritance that the base class stores the common members and functions?

friend class Hashbits;
protected:
std::vector<uint64_t> _tablesizes;
size_t _n_tables;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant with _tablesizes.size()? Not a big deal, but why risk the two variables getting out of sync?

@ctb
Copy link
Member Author

ctb commented Nov 3, 2016

@standage, I'll fix your points on the next PR, which is forthcoming.

@betatim, the inheritance hierarchy will expand and I will do a more thorough review then.

Thanks all :).

@ctb ctb merged commit 34711f6 into master Nov 3, 2016
@ctb ctb deleted the refactor/storage branch November 3, 2016 19:39
@ctb
Copy link
Member Author

ctb commented Nov 14, 2016

On Thu, Nov 03, 2016 at 11:22:01AM -0700, Tim Head wrote:

One pedantic comment: all the variables that used to be of type Hashtable but now are Hashgraphs are still called hashtable :-/ It'll start getting confusing when Hashtables are used more why the variable called hashtable actually contains a graph...

this should all have been fixed in #1504.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants