-
Notifications
You must be signed in to change notification settings - Fork 4
/
README
64 lines (54 loc) · 4.41 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
This library implements a bucket-based selection algorithm on GPUs
More details can be found in
* T. Ribizel and H. Anzt, "Approximate and Exact Selection on GPUs," 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 2019, pp. 471-478.
doi: 10.1109/IPDPSW.2019.00088
* T. Ribizel, H. Anzt, "Parallel selection on GPUs," Parallel Computing, Volume 91, 2020, doi: 10.1016/j.parco.2019.102588
It uses Catch2 as a test framework and the CUB library as a reference implementation for sorting.
The tests can be run by simply executing `app/unittests`, the benchmarks can be run by executing `app/benchmark` with one of the following parameters
[full] The full benchmark for exact single and multiple selection and the individual kernels (sample, count, reduce, filter)
[full-multionly] The full benchmark for multiple selection only
[approx] The full benchmark for approximate selection with shared-memory atomics
[approx-g] The full benchmark for approximate selection with global-memory atomics
[multi] The full benchmark for multiple selection with different numbers of ranks
[test] A small benchmark that only executes a single benchmark with small input size
The output of these tests is the following:
On stdout, they print error messages in case the algorithm execution produces invalid results. For the approx tests, additionally the exact and approximate rank are being output in CSV format.
On stderr, they print the individual timings of the kernels in CSV format for different input sizes given by the first CSV field. Runtime breakdowns are listed within parentheses ().
`app/benchmark-sort` contains a benchmark for the CUB radix sort implementation as a performance baseline for the multiple selection.
Structure of the project
include/cpu_reference.hpp - Reference implementations for testing
include/verification.hpp - Validation functions for testing
include/cuda_definitions.cuh - Type definitions and hardware limits
include/cuda_error.cuh - Wrapper for CUDA error handling
include/cuda_memory.cuh - Wrapper for CUDA memory allocations
include/cuda_timer.cuh - Wrapper for CUDA timing measurements
include/kernel_config.cuh - Configuration struct for kernel templates
include/launcher_fwd.cuh - Forward-declarations of launcher and kernel templates
lib/generated/* - Explicit template instantiations to parallelize compilation
lib/cpu_reference.cpp - Reference implementations for testing
lib/verification.cpp - Validation functions for testing
lib/qs_launchers.cuh - Wrappers for quickselect kernels
lib/qs_recursion.cuh - Kernels for quickselect single-selection
lib/qs_recursion_multi.cuh - Kernels for quickselect multi-selection
lib/qs_reduce.cuh - Kernels for reducing quickselect partial sums
lib/qs_scan.cuh - Kernels for quickselect bipartitioning
lib/ssss_build_searchtree.cuh - Kernels for sampleselect sampling
lib/ssss_collect.cuh - Kernels for sampleselect single-selection filtering
lib/ssss_collect_multi.cuh - Kernels for sampleselect multi-selection filtering
lib/ssss_count.cuh - Kernels for sampleselect counting
lib/ssss_launchers.cuh - Wrappers for sampleselect kernels
lib/ssss_merged.cuh - Kernels for multiple simultaneous sampleselects
lib/ssss_merged_memory.cuh - Auxiliary data structure for sampleselect multi-selection
lib/ssss_recursion.cuh - Kernels for sampleselect single-selection
lib/ssss_recursion_multi.cuh - Kernels for sampleselect multi-selection
lib/ssss_reduce.cuh - Kernels for reducing sampleselect partial sums
lib/utils_basecase.cuh - Kernels for recursion basecase
lib/utils_bytestorage.cuh - Auxiliary functions for reading/writing unaligned bytes
lib/utils_mask.cuh - Auxiliary functions for bitmasks
lib/utils_prefixsum.cuh - Auxiliary functions for tree-based partial sums
lib/utils_sampling.cuh - Auxiliary functions for sampling
lib/utils_search.cuh - Auxiliary functions for binary and warp-ary searches
lib/utils_sort.cuh - Auxiliary functions for bitonic sorting
lib/utils_warpaggr.cuh - Auxiliary functions for warp-aggregation
lib/utils_work.cuh - Auxiliary functions for work-distribution
lib/utils.cuh - Auxiliary wrappers for basic operations