-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement google's sparsehash #689
Comments
To clarify @mr-c's rather terse message :). Google Sparsehash (https://code.google.com/p/sparsehash/) could be implemented as a dynamically sized replacement for the core count()/get_count() behavior in Hashtable, or, more specifically, in at least one of its instantiations in CountingHash and/or Hashbits. This could usefully be done as a hack/slash at first, just getting it in there; it could even done as a subclass of CountingHash, since most or even all of the cruft in CountingHash either uses count()/get_count() directly or ignores it completely. |
My guess is that this is a fun 1 hr exercise for someone who knows C++ well /cc @brtaylor92, plus a fair amount of cursing about our C++ object hierarchy afterwards :) |
sparsehash documentation is a bit sparse (HAH!), but I used the sparse_hash_set to try to do exact cardinality counting. It's it not the same API as the sparse_hash_map, but it is pretty close. The code is in a gist for now, and I based it on the example in the docs. For testing, I used a file with:
It took about 10 minutes to run, consumed 12 GB (and it was single threaded). For comparison, the Python version consumed more than 30 GB (I don't remember how long it took to run), and reported a 32-mer cardinality of 129,196,601. Yes, they are not the same, I need to check why. I'm inclined to consider the Python version correct, because I don't know how sparse_hash_set is implemented and how it deals with collisions. Finally, the HLL counter version reported 129,388,424 32-mers and took 53s to run (single threaded). Using 16 threads it took 4s. This is inside the 1% acceptable error I used to initialize it. |
Nice, @luizirber! |
Luiz, the HLL counter performs pretty good. Is there a newer version On Sun, Dec 14, 2014 at 2:48 PM, Luiz Irber notifications@github.com wrote:
|
@qingpeng , the feature/hll-counter branch will be merged soon. I need to improve the sandbox script, but it shows how to use HLL to read from a file. |
BTW, to whoever tackles this, sparsehash doesn't compile with clang. Check this patch for a solution. |
@mr-c suggested doing this with a cascading bloom filter/cascading count min sketch instead: http://arxiv.org/abs/1302.7278 |
Hrrrrm, rethinking the suggestion of using a cascading bloom filter. There are two different use cases:
My guess is that sparsehash is a good simple implementation to mature the API, and would be the place to start. (Or, heck, just an stl map, although that probably wouldn't be very performant.) cc @mr-c |
I'm maintaining the exact cardinality counting with sparsehash implementation at https://github.com/luizirber/2014-hll-counter/blob/b11fb3f536ab654021bb8ce914fa07a9404b3b3e/src/unique_kmers.cc |
Forgetful bloom filters, #1198 may also be a way to implement dynamic memory allocation (see comment above, #689 (comment)) |
A quick hack to check how https://github.com/dib-lab/khmer/compare/feature/sparsehash Notebook with the results (and plot): Command for generating the output files (using pidstat) cd bench
python counting_cmp.py SRR1304364_1.fastq & pidstat -p $! -r 1 | gzip > output_nosh.pidstat.gz So... The overhead is quite big, it ends up using almost double the memory than the current solution, and it also takes way longer to run. If someone wants to try to use |
Hi @luizirber, this is sort of expected, right? Our existing approach is optimized for memory (first), and should be quite fast; it's going to take a lot of specific engineering to compete. But the API can be valuable. |
sparse++ may be interesting as an alternative to/derivative of sparsehash |
Maybe @brtaylor92 would be interested
The text was updated successfully, but these errors were encountered: