sketch is a generic, header-only library providing implementations of a variety of sketch data structures for scalable and streaming applications.
All have been accelerated with SIMD parallelism where possible, most are composable, and many are threadsafe unless -DNOT_THREADSAFE
is passed as a compilation flag.
Documentation for the Python interface is available here.
We directly include blaze-lib, libpopcnt, compact_vector, ska::flat_hash_map, and xxHash for various utilities. We also have two submodules:
- pybind11, only used for python bindings.
- SLEEF for vectorized math, incorporated with vec.h. It's optionally used (disabled by defining
-DNO_SLEEF=1/#define NO_SLEEF 1
) and only applicable to using rnla.h through blaze-lib.
You can ignore both for most use cases.
- HyperLogLog Implementation [hll.h]
hll_t
/hllbase_t<HashStruct>
- Estimates the cardinality of a set using log(log(cardinality)) bits.
- Threadsafe unless
-DNOT_THREADSAFE
is passed. - Currently,
hll
is the only structure for which python bindings are available, but we intend to extend this in the future.
- HyperBitBit [hbb.h]
- Better per-bit accuracy than HyperLogLogs, but, at least currently, limited to 128 bits/16 bytes in sketch size.
- Bloom Filter [bf.h]
bf_t
/bfbase_t<HashStruct>
- Naive bloom filter
- Currently not threadsafe.
- Count-Min and Count Sketches
- ccm.h (
ccmbase_t<UpdatePolicy=Increment>/ccm_t
(usepccm_t
for Approximate Counting orcs_t
for a count sketch). - The Count sketch is threadsafe if
-DNOT_THREADSAFE
is not passed or if an atomic container is used. Count-Min sketches are currently not threadsafe due to the use of minimal updates. - Count-min sketches can support concept drift if
realccm_t
from mult.h is used.
- ccm.h (
- MinHash sketches
- mh.h (
RangeMinHash
is the currently verified implementation.) We recommend you build the sketch and then convert to a linear container (e.g., astd::vector
) usingto_container<ContainerType>()
or.finalize()
for faster comparisons.- BottomKHasher is an alternate that uses more space to reduce runtime, which finalizes() into the same structure.
- CountingRangeMinHash performs the same operations as RangeMinHash, but provides multiplicities, which facilitates
histogram_similarity
, a generalization of Jaccard with multiplicities. - Both CountingRangeMinHash and RangeMinHash can be finalized into containers for fast comparisons with
.finalize()
. - A draft HyperMinHash implementation is available as well, but it has not been thoroughly vetted.
- Range MinHash implementationsare not threadsafe.
- HyperMinHash implementation is threa
- mh.h (
- B-Bit MinHash
- bbmh.h
- One-permutation (partition) bbit minhash
- Threadsafe, bit-packed and fully SIMD-accelerated
- Power of two partitions are supported in BBitMinHasher, which is finalized into a FinalBBitMinHash sketch. This is faster than the alternative.
- We also support arbitrary divisions using fastmod64 with DivBBitMinHasher and its corresponding final sketch, FinalDivBBitMinHash.
- One-permutation counting bbit minhash
- Not threadsafe.
- ModHash sketches
- mod.h
- Estimates both containment and jaccard index, but takes a (potentially) unbounded space.
- This returns a FinalRMinHash sketch, reusing the infrastructure for minhash sketches, but which calculates Jaccard index and containment knowing that it was generated via mod, not min.
- HeavyKeeper
- hk.h
- Reference: https://www.usenix.org/conference/atc18/presentation/gong
- A seemingly unilateral improvement over count-min sketches.
- One drawback is the inability to delete items, which makes it unsuitable for sliding windows.
- It shares this characteristic with the Count-Min sketch with conservative update and the Count-Min Mean sketch.
- ntcard
- mult.h
- Threadsafe
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/28453674
- Not SIMD-accelerated, but also general, supporting any arbitrary coverage level
- PCSA
- pc.h
- The HLL is more performant and better-optimized, but this is included for completeness.
- Not threadsafe.
- Reference: https://dl.acm.org/doi/10.1016/0022-0000%2885%2990041-8 Journal of Computer and System Sciences. September 1985 https://doi.org/10.1016/0022-0000(85)90041-8
- SetSketch
- See setsketch.h for continuous and discretized versions of the SetSketch.
- This also includes parameter-setting code.
To build and run the hll test case:
make test && ./test
To use as a header-only library:
using namespace sketch;
hll::hll_t hll(14); // Use 2**14 bytes for this structure
// Add hashed values for each element to the structure.
for(uint64_t i(0); i < 10000000ull; ++i) hll.addh(i);
fprintf(stderr, "Elements estimated: %lf. Error bounds: %lf.\n", hll.report(), hll.est_err());
The other structures work with a similar interface. See the type constructors for more information or view 10xdash for examples on using the same interface for a variety of data structures.
Simply #include sketch/<header_name>
, or, for one include #include <sketch/sketch.h>
,
which allows you to write sketch::bf_t
and sketch::hll_t
without the subnamespaces.
We use inline namespaces for individual types of sketches, e.g., sketch::minhash
or sketch::hll
can be used for clarity, or sketch::hll_t
can be used, omitting the hll
namespace.
Clang on OSX may fail to compile in AVX512 mode. We recommend using homebrew's gcc:
homebrew upgrade gcc || homebrew install gcc
and either setting the environmental variables for CXX and CC to the most recent g++/gcc or providing them as Makefile arguments.
At the time of writing, this is g++-10
and gcc-10
.
By default, updates to the hyperloglog structure to occur using atomic operations, though threading should be handled by the calling code. Otherwise, the flag -DNOT_THREADSAFE
should be passed. The cost of this is relatively minor, but in single-threaded situations, this would be preferred.
Python bindings are available via pybind11. Simply cd python && python setup.py install
.
The package has been renamed to sketch_ds
as of v0.19
Utilities include: 1. Sketching/serialization for sketch data structures 1. Supported: sketch_ds.bbmh.BBitMinHasher, sketch_ds.bf.bf, sketch_ds.hmh.hmh, sketch_ds.hll.hll 2. shs_isz, which computes the intersection size of sorted hash sets. 1. Supported: {uint,int}{32,64}, float32, float64 3. fastmod/fastdiv, which uses the fast modulo reduction to do faster division/mod than numpy. 1. Supportd: {uint,int}{32,64} 4. matrix generation functions - taking a list of sketches and creating the similarity function matrix. 1. Supported: sketch_ds.bbmh.BBitMinHasher, sketch_ds.bf.bf, sketch_ds.hmh.hmh, sketch_ds.hll.hll 2. Types: "jaccard_matrix", "intersection_matrix", "containment_matrix", "union_size_matrix", "symmetric_containment_matrix" 3. Returns a compressed distance matrix. 5. ccount_eq, pcount_eq compute the number of identical registers between integral registers. 1. Inspired by cdist and pdist from scipy.spatial.distance 2. ccount_eq computes the number of identical registers between all pairs of rows between two matrices A and B. 1. Size of returned matrix: (A.shape[0], A.shape[1]) 3. pcount_eq computes the number of identical registers between all pairs of rows in a single matrix A. 1. Size of returned matrix: (A.shape[0] * (A.shape[0]) - 1) / 2 4. pcount_eq output can be transformed from similarities to distances via -np.log(distmat / A.shape[1]).