Skip to content

rocPRIM 3.3.0 for ROCm 6.3.0

Compare
Choose a tag to compare
@rocm-ci rocm-ci released this 03 Dec 19:49
1eefdb7

Added

    • Changed the default value of rmake.py -a to default_gpus. This is equivalent to gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201.
  • The --test smoke option has been added to rtest.py. When rtest.py is called with this option it runs a subset of tests such that the total test time is 5 minutes. Use python3 ./rtest.py --test smoke or python3 ./rtest.py -t smoke to run the smoke test.
  • The --seed option has been added to run_benchmarks.py. The --seed option specifies a seed for the generation of random inputs. When the option is omitted, the default behavior is to use a random seed for each benchmark measurement.
  • Added configuration autotuning to device partition (rocprim::partition, rocprim::partition_two_way, and rocprim::partition_three_way), to device select (rocprim::select, rocprim::unique, and rocprim::unique_by_key), and to device reduce by key (rocprim::reduce_by_key) to improve performance on selected architectures.
  • Added rocprim::uninitialized_array to provide uninitialized storage in local memory for user-defined types.
  • Added large segment support for rocprim:segmented_reduce.
  • Added a parallel nth_element device function similar to std::nth_element. nth_element places elements that are smaller than the nth element before the nth element, and elements that are bigger than the nth element after the nth element.
  • Added deterministic (bitwise reproducible) algorithm variants rocprim::deterministic_inclusive_scan, rocprim::deterministic_exclusive_scan, rocprim::deterministic_inclusive_scan_by_key, rocprim::deterministic_exclusive_scan_by_key, and rocprim::deterministic_reduce_by_key. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance.
  • Added a parallel partial_sort and partial_sort_copy device functions similar to std::partial_sort and std::partial_sort_copy. partial_sort and partial_sort_copy arrange elements such that the elements are in the same order as a sorted list up to and including the middle index.

Changed

  • Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
  • Changed the default seed for device_benchmark_segmented_reduce.

Removed

  • rocprim::thread_load() and rocprim::thread_store() have been deprecated. Use dereference() instead.

Resolved issues

  • Fixed an issue in rmake.py where the list storing cmake options would contain individual characters instead of a full string of options.
  • Resolved an issue in rtest.py where it crashed if the build folder was created without release or debug subdirectories.
  • Resolved an issue with rtest.py on Windows where passing an absolute path to --install_dir caused a FileNotFound error.
  • rocPRIM functions are no longer forcefully inlined on Windows. This significantly reduces the build
    time of debug builds.
  • block_load, block_store, block_shuffle, block_exchange, and warp_exchange now use placement new instead of copy assignment (operator=) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
  • Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (uint8_t and int8_t) and instantiations of custom_type. Specifically, device binary search, device histogram, device merge and warp sort are affected.
  • Fixed a bug for rocprim::merge_path_search where using unsigned offsets would produce incorrect results.
  • Fixed a bug for rocprim::thread_load and rocprim::thread_store where float and double were not cast to the correct type, resulting in incorrect results.
  • Resolved an issue where tests where failing when they were compiled with -D_GLIBCXX_ASSERTIONS=ON.
  • Resolved an issue where algorithms that used an internal serial merge routine caused a memory access fault that resulted in potential performance drops when using block sort, device merge sort (block merge), device merge, device partial sort, and device sort (merge sort).
  • Fixed memory leaks in unit tests due to missing calls to hipFree() and the incorrect use of hipGraphs.
  • Fixed an issue where certain inputs to block_sort_merge(), device_merge_sort_merge_path(), device_merge(), and warp_sort_stable() caused an assertion error during the call to serial_merge().