This repository contains benchmarks for the AMS-Filter, a high-performance blocked Bloom filter based on the implementation presented in the paper Performance-Optimal Filtering: Bloom Overtakes Cuckoo at High Throughput.
- A CPU with AVX2 support (or AVX-512). E.g, AMD Ryzen, Intel Haswell (or later)
- CUDA v9.0 or later.
- A C++14-capable compiler compatible with your version of CUDA.
- The Boost C++ Libraries, version 1.58 or later
git clone git@github.com:harald-lang/amsfilter.git
cd amsfilter
git submodule update --remote --recursive --init
mkdir build
cd build/
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j 6
- host: runs a CPU-only benchmark
- device: runs the benchmark on a (single) GPU
- device_raw: similar to device, but the performance measurements do NOT include any data transfers between host and device.
- coproc: runs the CPU-GPU co-processing benchmark
The benchmarks are parameterized through environment variables which are described in the following.
-
CONFIG
: used to configure the Bloom filter. The configuration consists of a comma-separated list of four integer values:- The number of words per block (w), which defines the size of a block within the Bloom filter. Each block consists of one or more 32-bit words.
- The number of sectors per block (s).
- The number of sector groups (z).
- The number of hash functions (k).
By default, a register-blocked Bloom filter is used (=1,1,1,4).
-
GEN
: defines how the data is generated. (default: uniform)- uniform: generates random integers, uniformly distributed.
- markov,f: generates random integers using a Markov process. The parameter f refers to the clustering factor which is the average number of consecutive identical integers. For instance, the clustering factor of the integer sequence 1, 1, 1, 2, 2, 3, 3, 3, 4, 4 is 2.5, as there are 4 runs and the total length of the sequence is 10. The Markov process generates a uniformly distributed integer sequence if f = 1. - Note, that the actual f is at most 2% off.
-
INSERT_CNT_LOG2
: the log2 of the number of keys to insert during the build phase (default: 24) -
LOOKUP_CNT_LOG2
: the log2 of the number of keys to lookup during the probe phase (default: 28) -
KEYS_PINNED
: 0 = keys are located in pageable memory, 1 = keys are located in pinned memory (default: 1) -
VALIDATE
: 0 = do not validate the results (default: 1)
BATCH_SIZE_LOG2
: the log2 of the number of keys to lookup in one go (default: 10)THREAD_CNT
: the number of threads that concurrently probe the filter
BATCH_SIZE_LOG2
: the log2 of the number of keys to lookup in one go (default: 10)DEVICE_NO
: specifies the CUDA device to use (default: 0)
HOST_BATCH_SIZE_LOG2
: the log2 of the number of keys to lookup in one go on the host side (default: 10)DEVICE_BATCH_SIZE_LOG2
: the log2 of the number of keys to lookup in one go on the device side (default: 20)THREAD_CNT
: the number of CPU threads that concurrently probe the filter on the host side (default: # of available cores)CO_THREAD_CNT
: the number of CPU threads that dispatch work to a GPU (default: THREAD_CNT)DEVICE_NO
: specifies the CUDA device(s) to use. Multiple devices can be specified as a comma-separated list (default: 0)
The following examples shows how to benchmark a cache-sectorized Bloom filter using all CPU cores and two GPU devices:
CONFIG=32,32,2,8 DEVICE_NO=0,1 VALIDATE=0 ./benchmark_coproc