Release rocPRIM 3.3.0 for ROCm 6.3.0 · ROCm/rocPRIM

Added

- Changed the default value of rmake.py -a to default_gpus. This is equivalent to gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201.
The --test smoke option has been added to rtest.py. When rtest.py is called with this option it runs a subset of tests such that the total test time is 5 minutes. Use python3 ./rtest.py --test smoke or python3 ./rtest.py -t smoke to run the smoke test.
The --seed option has been added to run_benchmarks.py. The --seed option specifies a seed for the generation of random inputs. When the option is omitted, the default behavior is to use a random seed for each benchmark measurement.
Added configuration autotuning to device partition (rocprim::partition, rocprim::partition_two_way, and rocprim::partition_three_way), to device select (rocprim::select, rocprim::unique, and rocprim::unique_by_key), and to device reduce by key (rocprim::reduce_by_key) to improve performance on selected architectures.
Added rocprim::uninitialized_array to provide uninitialized storage in local memory for user-defined types.
Added large segment support for rocprim:segmented_reduce.
Added a parallel nth_element device function similar to std::nth_element. nth_element places elements that are smaller than the nth element before the nth element, and elements that are bigger than the nth element after the nth element.
Added deterministic (bitwise reproducible) algorithm variants rocprim::deterministic_inclusive_scan, rocprim::deterministic_exclusive_scan, rocprim::deterministic_inclusive_scan_by_key, rocprim::deterministic_exclusive_scan_by_key, and rocprim::deterministic_reduce_by_key. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance.
Added a parallel partial_sort and partial_sort_copy device functions similar to std::partial_sort and std::partial_sort_copy. partial_sort and partial_sort_copy arrange elements such that the elements are in the same order as a sorted list up to and including the middle index.

Changed

Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
Changed the default seed for device_benchmark_segmented_reduce.

Removed

rocprim::thread_load() and rocprim::thread_store() have been deprecated. Use dereference() instead.

Resolved issues

Fixed an issue in rmake.py where the list storing cmake options would contain individual characters instead of a full string of options.
Resolved an issue in rtest.py where it crashed if the build folder was created without release or debug subdirectories.
Resolved an issue with rtest.py on Windows where passing an absolute path to --install_dir caused a FileNotFound error.
rocPRIM functions are no longer forcefully inlined on Windows. This significantly reduces the build
time of debug builds.
block_load, block_store, block_shuffle, block_exchange, and warp_exchange now use placement new instead of copy assignment (operator=) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (uint8_t and int8_t) and instantiations of custom_type. Specifically, device binary search, device histogram, device merge and warp sort are affected.
Fixed a bug for rocprim::merge_path_search where using unsigned offsets would produce incorrect results.
Fixed a bug for rocprim::thread_load and rocprim::thread_store where float and double were not cast to the correct type, resulting in incorrect results.
Resolved an issue where tests where failing when they were compiled with -D_GLIBCXX_ASSERTIONS=ON.
Resolved an issue where algorithms that used an internal serial merge routine caused a memory access fault that resulted in potential performance drops when using block sort, device merge sort (block merge), device merge, device partial sort, and device sort (merge sort).
Fixed memory leaks in unit tests due to missing calls to hipFree() and the incorrect use of hipGraphs.
Fixed an issue where certain inputs to block_sort_merge(), device_merge_sort_merge_path(), device_merge(), and warp_sort_stable() caused an assertion error during the call to serial_merge().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocPRIM 3.3.0 for ROCm 6.3.0

Added

Changed

Removed

Resolved issues