Skip to content

Releases: ROCm/hipCUB

hipCUB 3.2.0 for ROCm 6.2.2

27 Sep 16:01
1875530
Compare
Choose a tag to compare

hipCUB code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.

hipCUB 3.2.0 for ROCm 6.2.1

20 Sep 19:57
1875530
Compare
Choose a tag to compare

hipCUB code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.

hipCUB 3.2.0 for ROCm 6.2.0

02 Aug 16:15
1875530
Compare
Choose a tag to compare

Added

  • Add DeviceCopy function to have parity with CUB.

  • In the rocPRIM backend, added enum WarpExchangeAlgorithm, which is used as the new optional template argument for WarpExchange.

    • The potential values for the enum are WARP_EXCHANGE_SMEM and WARP_EXCHANGE_SHUFFLE.
    • WARP_EXCHANGE_SMEM stands for the previous algorithm, while WARP_EXCHANGE_SHUFFLE performs the exchange via shuffle operations.
    • WARP_EXCHANGE_SHUFFLE does not require any pre-allocated shared memory, but the ItemsPerThread must be a divisor of WarpSize.
  • Added tuple.hpp which defines templates hipcub::tuple, hipcub::tuple_element, hipcub::tuple_element_t and hipcub::tuple_size.

  • Added new overloaded member functions to BlockRadixSort and DeviceRadixSort that expose a decomposer argument. Keys of a custom
    type (key_type) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implement
    operator(const key_type&) which returns a hipcub::tuple of references pointing to members of key_type.

  • On AMD GPUs (using the HIP backend), it is possible to issue hipCUB API calls inside of
    hipGraphs, with several exceptions:

    • CachingDeviceAllocator
    • GridBarrierLifetime
    • DeviceSegmentedRadixSort
    • DeviceRunLengthEncode
      Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of hipGraphs.

Changed

  • The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.

Fixed

  • Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
    It now derives the accumulator type as the result of the binary operator.
  • debug_synchronous has been deprecated in hipCUB-2.13.2, and it no longer has any effect. With this release, passing debug_synchronous
    to the device functions results in a deprecation warning both at runtime and at compile time.
    • The synchronization that was previously achievable by passing debug_synchronous=true can now be achieved at compile time
      by setting the CUB_DEBUG_SYNC (or higher debug level) or the HIPCUB_DEBUG_SYNC preprocessor definition.
    • The compile time deprecation warnings can be disabled by defining the HIPCUB_IGNORE_DEPRECATED_API preprocessor definition.

hipCUB 3.1.0 for ROCm 6.1.2

04 Jun 16:53
44aa2e0
Compare
Choose a tag to compare

hipCUB code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.

hipCUB 3.1.0 for ROCm 6.1.1

08 May 17:59
44aa2e0
Compare
Choose a tag to compare

hipCUB code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.

hipCUB 3.1.0 for ROCm 6.1.0

16 Apr 19:09
44aa2e0
Compare
Choose a tag to compare

Changed

  • CUB backend references CUB and Thrust version 2.1.0.
  • Updated HIPCUB_HOST_WARP_THREADS macro definition to match host_warp_size changes from rocPRIM 3.0.
  • Implemented __int128_t and __uint128_t support for radix_sort.

Fixed

  • Fixed build issues with rmake.py on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.

Added

  • Added interface DeviceMemcpy::Batched for batched memcpy from rocPRIM and CUB.

hipCUB 3.0.0 for ROCm 6.0.2

31 Jan 20:12
761fccb
Compare
Choose a tag to compare

hipCUB code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.

hipCUB 3.0.0 for ROCm 6.0.0

15 Dec 18:30
761fccb
Compare
Choose a tag to compare

Changed

  • Removed DOWNLOAD_ROCPRIM, forcing rocPRIM to download can be done with DEPENDENCIES_FORCE_DOWNLOAD.

hipCUB 2.13.1 for ROCm 5.7.1

13 Oct 18:57
Compare
Choose a tag to compare

hipCUB code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.

hipCUB 2.13.1 for ROCm 5.7.0

15 Sep 17:29
Compare
Choose a tag to compare

Changed

  • CUB backend references CUB and Thrust version 2.0.1.
  • Fixed DeviceSegmentedReduce::ArgMin and DeviceSegmentedReduce::ArgMax by returning the segment-relative index instead of the absolute one.
  • Fixed DeviceSegmentedReduce::ArgMin for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to DeviceSegmentedReduce::ArgMax.

Known Issues

  • debug_synchronous no longer works on CUDA platform. CUB_DEBUG_SYNC should be used to enable those checks.
  • DeviceReduce::Sum does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
  • DeviceHistogram::HistogramEven fails on CUDA platform for [LevelT, SampleIteratorT] = [int, int].
  • DeviceHistogram::MultiHistogramEven fails on CUDA platform for [LevelT, SampleIteratorT] = [int, int/unsigned short/float/double] and [LevelT, SampleIteratorT] = [float, double].