diff --git a/CHANGELOG.md b/CHANGELOG.md index f4adc45a1..49615b2c8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,50 +1,43 @@ # Changelog for rocPRIM -Documentation for rocPRIM is available at -[https://rocm.docs.amd.com/projects/rocPRIM/en/latest/](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/). +Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projects/rocPRIM/en/latest/](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/). -## Unreleased rocPRIM-3.3.0 for ROCm 6.3.0 +## rocPRIM 3.3.0 for ROCm 6.3.0 ### Added -* Add --test smoke option in rtest.py. It will run a subset of tests such that the total test time is in 5 minutes. Use python3 ./rtest.py --test smoke or python3 ./rtest.py -t smoke to execute smoke test. -* Option `--seed` to benchmarks to specify a seed for the generation of random inputs. The default behavior is to keep using a random seed per benchmark measurement. -* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and device reduce by key (`rocprim::reduce_by_key`) for improved performance on selected architectures. -* Added `rocprim::uninitialized_array` which provides uninitialized storage in local memory for user-defined types. +* The `--test smoke` option has been added to `rtest.py`. When `rtest.py` is called with this option it runs a subset of tests such that the total test time is 5 minutes. Use `python3 ./rtest.py --test smoke` or `python3 ./rtest.py -t smoke` to run the smoke test. +* The `--seed` option has been added to `run_benchmarks.py`. The `--seed` option specifies a seed for the generation of random inputs. When the option is omitted, the default behavior is to use a random seed for each benchmark measurement. +* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), to device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and to device reduce by key (`rocprim::reduce_by_key`) to improve performance on selected architectures. +* Added `rocprim::uninitialized_array` to provide uninitialized storage in local memory for user-defined types. * Added large segment support for `rocprim:segmented_reduce`. -* Added a parallel `nth_element` device function similar to `std::nth_element`, this function rearranges elements smaller than the n-th before and bigger than the n-th after the n-th element. +* Added a parallel `nth_element` device function similar to `std::nth_element`. `nth_element` places elements that are smaller than the nth element before the nth element, and elements that are bigger than the nth element after the nth element. * Added deterministic (bitwise reproducible) algorithm variants `rocprim::deterministic_inclusive_scan`, `rocprim::deterministic_exclusive_scan`, `rocprim::deterministic_inclusive_scan_by_key`, `rocprim::deterministic_exclusive_scan_by_key`, and `rocprim::deterministic_reduce_by_key`. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance. -* Added a parallel `partial_sort` and `partial_sort_copy` device function similar to `std::partial_sort` and `std::partial_sort_copy`, these functions rearranges elements such that the elements are the same as a sorted list up to and including the middle index. +* Added a parallel `partial_sort` and `partial_sort_copy` device functions similar to `std::partial_sort` and `std::partial_sort_copy`. `partial_sort` and `partial_sort_copy` arrange elements such that the elements are in the same order as a sorted list up to and including the middle index. ### Changed * Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different. * Changed the default seed for `device_benchmark_segmented_reduce`. +### Removed + +* `rocprim::thread_load()` and `rocprim::thread_store()` have been deprecated. Use `dereference()` instead. + ### Resolved issues -* Fixed an issue in rtest.py where if the build folder was made without release or debug directory it would crash the program -* Fixed an issue where while running rtest.py on windows and passing in an absolute path to `--install_dir` causes a `FileNotFound` error. -* rocPRIM functions are no longer forcefully inlined on Windows, significantly reducing the build - time in debug builds. -* `block_load`, `block_store`, `block_shuffle`, `block_exchange` and `warp_exchange` now use placement `new` instead of copy - assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments. +* Resolved an issue in `rtest.py` where it crashed if the `build` folder was created without `release` or `debug` subdirectories. +* Resolved an issue with `rtest.py` on Windows where passing an absolute path to `--install_dir` caused a `FileNotFound` error. +* rocPRIM functions are no longer forcefully inlined on Windows. This significantly reduces the build + time of debug builds. +* `block_load`, `block_store`, `block_shuffle`, `block_exchange`, and `warp_exchange` now use placement `new` instead of copy assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments. * Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected. -* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would output wrong results. -* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not casted to the correct type resulting in wrong results. -* Fix tests failing when compiling with `-D_GLIBCXX_ASSERTIONS=ON`. -* Fixed a bug for algorithms that use an internal serial merge routine that causes a memory access fault. This may result in a performance drop when using: - * block sort, - * device merge sort (block merge), - * device merge, - * device partial sort, and/or - * device sort (merge sort). -* Fixed memory leaks in unit tests that were due to missing hipFree calls and incorrect use of hipGraphs -* Fixed an issue where on certain inputs to block_sort_merge, device_merge_sort_merge_path, device_merge, and warp_sort_stable would cause an assertion error during its call to serial_merge - -### Upcoming changes - -* `rocprim::thread_load` and `rocprim::thread_store` are deprecated. Use dereference instead. Not all of those functions are available on every device architecture, and their usage can hurt performance, because inline assembly inhibits optimizations. +* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would produce incorrect results. +* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not cast to the correct type, resulting in incorrect results. +* Resolved an issue where tests where failing when they were compiled with `-D_GLIBCXX_ASSERTIONS=ON`. +* Resolved an issue where algorithms that used an internal serial merge routine caused a memory access fault that resulted in potential performance drops when using block sort, device merge sort (block merge), device merge, device partial sort, and device sort (merge sort). +* Fixed memory leaks in unit tests due to missing calls to `hipFree()` and the incorrect use of hipGraphs. +* Fixed an issue where certain inputs to `block_sort_merge()`, `device_merge_sort_merge_path()`, `device_merge()`, and `warp_sort_stable()` caused an assertion error during the call to `serial_merge()`. ## rocPRIM-3.2.1 for ROCm 6.2.1