implement google's sparsehash #689

mr-c · 2014-12-12T17:27:54Z

mature the hash API

Maybe @brtaylor92 would be interested

ctb · 2014-12-14T12:52:31Z

To clarify @mr-c's rather terse message :).

Google Sparsehash (https://code.google.com/p/sparsehash/) could be implemented as a dynamically sized replacement for the core count()/get_count() behavior in Hashtable, or, more specifically, in at least one of its instantiations in CountingHash and/or Hashbits. This could usefully be done as a hack/slash at first, just getting it in there; it could even done as a subclass of CountingHash, since most or even all of the cruft in CountingHash either uses count()/get_count() directly or ignores it completely.

ctb · 2014-12-14T13:24:37Z

My guess is that this is a fun 1 hr exercise for someone who knows C++ well /cc @brtaylor92, plus a fair amount of cursing about our C++ object hierarchy afterwards :)

luizirber · 2014-12-14T19:48:21Z

sparsehash documentation is a bit sparse (HAH!), but I used the sparse_hash_set to try to do exact cardinality counting. It's it not the same API as the sparse_hash_map, but it is pretty close. The code is in a gist for now, and I based it on the example in the docs.

For testing, I used a file with:

144 MB
149,943,923 bp
44,336 seqs
3,382.0 average length

It took about 10 minutes to run, consumed 12 GB (and it was single threaded).
The reported cardinality for a 32-mer size was 129,121,210.

For comparison, the Python version consumed more than 30 GB (I don't remember how long it took to run), and reported a 32-mer cardinality of 129,196,601. Yes, they are not the same, I need to check why. I'm inclined to consider the Python version correct, because I don't know how sparse_hash_set is implemented and how it deals with collisions.

Finally, the HLL counter version reported 129,388,424 32-mers and took 53s to run (single threaded). Using 16 threads it took 4s. This is inside the 1% acceptable error I used to initialize it.
It consumed 21 MB of RAM (C++ version) or 140 MB (through the Python interpreter).

ctb · 2014-12-14T22:28:58Z

Nice, @luizirber!

qingpeng · 2014-12-16T04:55:59Z

Luiz, the HLL counter performs pretty good. Is there a newer version
of script or something that I can use to integrate into my IGS
analysis pipeline?

On Sun, Dec 14, 2014 at 2:48 PM, Luiz Irber notifications@github.com wrote:

sparsehash documentation is a bit sparse (HAH!), but I used the sparse_hash_set to try to do exact cardinality counting. It's it not the same API as the sparse_hash_map, but it is pretty close. The code is in a gist for now, and I based it on the example in the docs.

For testing, I used a file with:

144 MB
149,943,923 bp
44,336 seqs
3,382.0 average length

It took about 10 minutes to run, consumed 12 GB (and it was single threaded).
The reported cardinality for a 32-mer size was 129,121,210.

For comparison, the Python version consumed more than 30 GB (I don't remember how long it took to run), and reported a 32-mer cardinality of 129,196,601. Yes, they are not the same, I need to check why. I'm inclined to consider the Python version correct, because I don't know how sparse_hash_set is implemented and how it deals with collisions.

Finally, the HLL counter reported 129,388,424 32-mers and took 53s to run (single threaded). Using 16 threads it took 4s. This is inside the 1% acceptable error I used to initialize it.

—
Reply to this email directly or view it on GitHub.

luizirber · 2014-12-16T15:38:03Z

@qingpeng , the feature/hll-counter branch will be merged soon. I need to improve the sandbox script, but it shows how to use HLL to read from a file.

luizirber · 2014-12-16T15:44:46Z

BTW, to whoever tackles this, sparsehash doesn't compile with clang. Check this patch for a solution.

ctb · 2015-05-11T21:40:02Z

@mr-c suggested doing this with a cascading bloom filter/cascading count min sketch instead: http://arxiv.org/abs/1302.7278

ctb · 2015-05-26T18:43:14Z

Hrrrrm, rethinking the suggestion of using a cascading bloom filter.

There are two different use cases:

a structure that can grow dynamically, so we don't need to preallocate;
a structure that can store exact graph structures, to alleviate concerns about inexactness;
a simple implementation to mature the API;

My guess is that sparsehash is a good simple implementation to mature the API, and would be the place to start. (Or, heck, just an stl map, although that probably wouldn't be very performant.)

cc @mr-c

luizirber · 2015-07-30T19:25:54Z

I'm maintaining the exact cardinality counting with sparsehash implementation at https://github.com/luizirber/2014-hll-counter/blob/b11fb3f536ab654021bb8ce914fa07a9404b3b3e/src/unique_kmers.cc
(there is also a very crude Makefile for compiling it using libkhmer.a)

ctb · 2015-07-31T13:57:26Z

Forgetful bloom filters, #1198 may also be a way to implement dynamic memory allocation (see comment above, #689 (comment))

luizirber · 2016-05-16T03:17:33Z

A quick hack to check how sparsetable from sparsehash behave:

https://github.com/dib-lab/khmer/compare/feature/sparsehash

Notebook with the results (and plot):
https://github.com/dib-lab/khmer/blob/792d58c061e69adade744832ebc2dbd84350af51/bench/Benchmark.ipynb

Command for generating the output files (using pidstat)

cd bench
python counting_cmp.py SRR1304364_1.fastq & pidstat -p $! -r 1 | gzip > output_nosh.pidstat.gz

So... The overhead is quite big, it ends up using almost double the memory than the current solution, and it also takes way longer to run. If someone wants to try to use sparse_hash_map instead, it might work better. Good luck =]

ctb · 2016-05-16T13:17:44Z

Hi @luizirber, this is sort of expected, right? Our existing approach is optimized for memory (first), and should be quite fast; it's going to take a lot of specific engineering to compete. But the API can be valuable.

brtaylor92 · 2016-08-22T17:05:47Z

sparse++ may be interesting as an alternative to/derivative of sparsehash

mr-c added enhancement C++ labels Dec 12, 2014

ctb mentioned this issue Jan 18, 2015

User experience upgrades for khmer #732

Closed

mr-c added this to the 2.0+ milestone Jul 30, 2015

ctb mentioned this issue Jul 31, 2015

Dynamic scaling of core data structures #1215

Open

ctb mentioned this issue Sep 25, 2015

No streaming for abundance-dist-single.py #1312

Closed

standage modified the milestones: unscheduled, 2.0+ Feb 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement google's sparsehash #689

implement google's sparsehash #689

mr-c commented Dec 12, 2014

ctb commented Dec 14, 2014

ctb commented Dec 14, 2014

luizirber commented Dec 14, 2014

ctb commented Dec 14, 2014

qingpeng commented Dec 16, 2014

luizirber commented Dec 16, 2014

luizirber commented Dec 16, 2014

ctb commented May 11, 2015

ctb commented May 26, 2015

luizirber commented Jul 30, 2015

ctb commented Jul 31, 2015

luizirber commented May 16, 2016

ctb commented May 16, 2016

brtaylor92 commented Aug 22, 2016

implement google's sparsehash #689

implement google's sparsehash #689

Comments

mr-c commented Dec 12, 2014

ctb commented Dec 14, 2014

ctb commented Dec 14, 2014

luizirber commented Dec 14, 2014

ctb commented Dec 14, 2014

qingpeng commented Dec 16, 2014

luizirber commented Dec 16, 2014

luizirber commented Dec 16, 2014

ctb commented May 11, 2015

ctb commented May 26, 2015

luizirber commented Jul 30, 2015

ctb commented Jul 31, 2015

luizirber commented May 16, 2016

ctb commented May 16, 2016

brtaylor92 commented Aug 22, 2016