Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement google's sparsehash #689

Open
mr-c opened this issue Dec 12, 2014 · 14 comments
Open

implement google's sparsehash #689

mr-c opened this issue Dec 12, 2014 · 14 comments

Comments

@mr-c
Copy link
Contributor

mr-c commented Dec 12, 2014

  • mature the hash API

Maybe @brtaylor92 would be interested

@ctb
Copy link
Member

ctb commented Dec 14, 2014

To clarify @mr-c's rather terse message :).

Google Sparsehash (https://code.google.com/p/sparsehash/) could be implemented as a dynamically sized replacement for the core count()/get_count() behavior in Hashtable, or, more specifically, in at least one of its instantiations in CountingHash and/or Hashbits. This could usefully be done as a hack/slash at first, just getting it in there; it could even done as a subclass of CountingHash, since most or even all of the cruft in CountingHash either uses count()/get_count() directly or ignores it completely.

@ctb
Copy link
Member

ctb commented Dec 14, 2014

My guess is that this is a fun 1 hr exercise for someone who knows C++ well /cc @brtaylor92, plus a fair amount of cursing about our C++ object hierarchy afterwards :)

@luizirber
Copy link
Member

sparsehash documentation is a bit sparse (HAH!), but I used the sparse_hash_set to try to do exact cardinality counting. It's it not the same API as the sparse_hash_map, but it is pretty close. The code is in a gist for now, and I based it on the example in the docs.

For testing, I used a file with:

  • 144 MB
  • 149,943,923 bp
  • 44,336 seqs
  • 3,382.0 average length

It took about 10 minutes to run, consumed 12 GB (and it was single threaded).
The reported cardinality for a 32-mer size was 129,121,210.

For comparison, the Python version consumed more than 30 GB (I don't remember how long it took to run), and reported a 32-mer cardinality of 129,196,601. Yes, they are not the same, I need to check why. I'm inclined to consider the Python version correct, because I don't know how sparse_hash_set is implemented and how it deals with collisions.

Finally, the HLL counter version reported 129,388,424 32-mers and took 53s to run (single threaded). Using 16 threads it took 4s. This is inside the 1% acceptable error I used to initialize it.
It consumed 21 MB of RAM (C++ version) or 140 MB (through the Python interpreter).

@ctb
Copy link
Member

ctb commented Dec 14, 2014

Nice, @luizirber!

@qingpeng
Copy link
Contributor

Luiz, the HLL counter performs pretty good. Is there a newer version
of script or something that I can use to integrate into my IGS
analysis pipeline?

On Sun, Dec 14, 2014 at 2:48 PM, Luiz Irber notifications@github.com wrote:

sparsehash documentation is a bit sparse (HAH!), but I used the sparse_hash_set to try to do exact cardinality counting. It's it not the same API as the sparse_hash_map, but it is pretty close. The code is in a gist for now, and I based it on the example in the docs.

For testing, I used a file with:

144 MB
149,943,923 bp
44,336 seqs
3,382.0 average length

It took about 10 minutes to run, consumed 12 GB (and it was single threaded).
The reported cardinality for a 32-mer size was 129,121,210.

For comparison, the Python version consumed more than 30 GB (I don't remember how long it took to run), and reported a 32-mer cardinality of 129,196,601. Yes, they are not the same, I need to check why. I'm inclined to consider the Python version correct, because I don't know how sparse_hash_set is implemented and how it deals with collisions.

Finally, the HLL counter reported 129,388,424 32-mers and took 53s to run (single threaded). Using 16 threads it took 4s. This is inside the 1% acceptable error I used to initialize it.


Reply to this email directly or view it on GitHub.

@luizirber
Copy link
Member

@qingpeng , the feature/hll-counter branch will be merged soon. I need to improve the sandbox script, but it shows how to use HLL to read from a file.

@luizirber
Copy link
Member

BTW, to whoever tackles this, sparsehash doesn't compile with clang. Check this patch for a solution.

@ctb
Copy link
Member

ctb commented May 11, 2015

@mr-c suggested doing this with a cascading bloom filter/cascading count min sketch instead: http://arxiv.org/abs/1302.7278

@ctb
Copy link
Member

ctb commented May 26, 2015

Hrrrrm, rethinking the suggestion of using a cascading bloom filter.

There are two different use cases:

  • a structure that can grow dynamically, so we don't need to preallocate;
  • a structure that can store exact graph structures, to alleviate concerns about inexactness;
  • a simple implementation to mature the API;

My guess is that sparsehash is a good simple implementation to mature the API, and would be the place to start. (Or, heck, just an stl map, although that probably wouldn't be very performant.)

cc @mr-c

@luizirber
Copy link
Member

I'm maintaining the exact cardinality counting with sparsehash implementation at https://github.com/luizirber/2014-hll-counter/blob/b11fb3f536ab654021bb8ce914fa07a9404b3b3e/src/unique_kmers.cc
(there is also a very crude Makefile for compiling it using libkhmer.a)

@mr-c mr-c added this to the 2.0+ milestone Jul 30, 2015
@ctb
Copy link
Member

ctb commented Jul 31, 2015

Forgetful bloom filters, #1198 may also be a way to implement dynamic memory allocation (see comment above, #689 (comment))

@luizirber
Copy link
Member

A quick hack to check how sparsetable from sparsehash behave:

https://github.com/dib-lab/khmer/compare/feature/sparsehash

Notebook with the results (and plot):
https://github.com/dib-lab/khmer/blob/792d58c061e69adade744832ebc2dbd84350af51/bench/Benchmark.ipynb

Command for generating the output files (using pidstat)

cd bench
python counting_cmp.py SRR1304364_1.fastq & pidstat -p $! -r 1 | gzip > output_nosh.pidstat.gz

So... The overhead is quite big, it ends up using almost double the memory than the current solution, and it also takes way longer to run. If someone wants to try to use sparse_hash_map instead, it might work better. Good luck =]

@ctb
Copy link
Member

ctb commented May 16, 2016

Hi @luizirber, this is sort of expected, right? Our existing approach is optimized for memory (first), and should be quite fast; it's going to take a lot of specific engineering to compete. But the API can be valuable.

@brtaylor92
Copy link
Contributor

sparse++ may be interesting as an alternative to/derivative of sparsehash

@standage standage modified the milestones: unscheduled, 2.0+ Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants