Skip to content

Conversation

@elstehle
Copy link
Contributor

Description

Closes #3407

@elstehle elstehle requested a review from a team as a code owner January 22, 2025 05:20
@elstehle elstehle requested a review from gevtushenko January 22, 2025 05:20
@github-actions
Copy link
Contributor

🟨 CI finished in 2h 27m: Pass: 97%/78 | Total: 2d 06h | Avg: 41m 37s | Max: 1h 13m | Hits: 156%/12720
  • 🟨 cccl_c_parallel: Pass: 50%/2 | Total: 10m 16s | Avg: 5m 08s | Max: 8m 04s

    🚨 jobs: Test 🚨
      🟩 Build              Pass: 100%/1   | Total:  2m 12s | Avg:  2m 12s | Max:  2m 12s
      🔥 Test               Pass:   0%/1   | Total:  8m 04s | Avg:  8m 04s | Max:  8m 04s
    🟨 cpu
      🟨 amd64              Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    🟨 ctk
      🟨 12.6               Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    🟨 cudacxx
      🟨 nvcc12.6           Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    🟨 cudacxx_family
      🟨 nvcc               Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    🟨 cxx
      🟨 GCC13              Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    🟨 cxx_family
      🟨 GCC                Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    🟨 gpu
      🟨 v100               Pass:  50%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  8m 04s
    
  • 🟥 python: Pass: 0%/1 | Total: 8m 18s | Avg: 8m 18s | Max: 8m 18s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 ctk
      🟥 12.6               Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 cudacxx
      🟥 nvcc12.6           Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 gpu
      🟥 v100               Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  8m 18s | Avg:  8m 18s | Max:  8m 18s
    
  • 🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 52m 22s | Max: 1h 13m | Hits: 160%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 07h | Avg: 51m 52s | Max:  1h 13m | Hits: 160%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 02m
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 03m | Avg:  1h 00m | Max:  1h 11m | Hits: 158%/885   
      🟩 12.5               Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
      🟩 12.6               Pass: 100%/31  | Total:  1d 01h | Avg: 49m 53s | Max:  1h 11m | Hits: 161%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 03m | Avg:  1h 00m | Max:  1h 11m | Hits: 158%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
      🟩 nvcc12.6           Pass: 100%/29  | Total: 23h 38m | Avg: 48m 55s | Max:  1h 11m | Hits: 161%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
      🟩 nvcc               Pass: 100%/36  | Total:  1d 07h | Avg: 51m 43s | Max:  1h 13m | Hits: 160%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 48m | Avg: 57m 09s | Max:  1h 00m
      🟩 Clang15            Pass: 100%/1   | Total: 58m 01s | Avg: 58m 01s | Max: 58m 01s
      🟩 Clang16            Pass: 100%/1   | Total: 56m 26s | Avg: 56m 26s | Max: 56m 26s
      🟩 Clang17            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 Clang18            Pass: 100%/7   | Total:  5h 50m | Avg: 50m 07s | Max:  1h 04m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 57m | Avg: 58m 41s | Max: 59m 47s
      🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 GCC9               Pass: 100%/2   | Total:  1h 55m | Avg: 57m 32s | Max: 58m 57s
      🟩 GCC10              Pass: 100%/1   | Total: 56m 37s | Avg: 56m 37s | Max: 56m 37s
      🟩 GCC11              Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m
      🟩 GCC12              Pass: 100%/3   | Total:  1h 46m | Avg: 35m 20s | Max: 59m 13s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 56m | Avg: 37m 00s | Max:  1h 02m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 21m | Avg:  1h 10m | Max:  1h 11m | Hits: 164%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m | Hits: 157%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 12h 35m | Avg: 53m 57s | Max:  1h 04m
      🟩 GCC                Pass: 100%/18  | Total: 13h 33m | Avg: 45m 11s | Max:  1h 02m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 41m | Avg:  1h 10m | Max:  1h 11m | Hits: 160%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 46m 47s | Avg: 23m 23s | Max: 27m 14s
      🟩 v100               Pass: 100%/36  | Total:  1d 08h | Avg: 53m 59s | Max:  1h 13m | Hits: 160%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 06h | Avg: 59m 13s | Max:  1h 13m | Hits: 160%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 23m 17s | Avg: 23m 17s | Max: 23m 17s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 20s | Avg: 15m 20s | Max: 15m 20s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 04m | Avg: 21m 37s | Max: 23m 59s
      🟩 TestGPU            Pass: 100%/2   | Total: 50m 55s | Avg: 25m 27s | Max: 25m 52s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 46m 47s | Avg: 23m 23s | Max: 27m 14s
      🟩 90a                Pass: 100%/1   | Total: 23m 56s | Avg: 23m 56s | Max: 23m 56s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 14h 28m | Avg:  1h 02m | Max:  1h 11m | Hits: 162%/2655  
      🟩 20                 Pass: 100%/24  | Total: 18h 41m | Avg: 46m 44s | Max:  1h 13m | Hits: 156%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 20h 38m | Avg: 33m 27s | Max: 1h 07m | Hits: 155%/9180

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 36m 43s | Avg: 18m 21s | Max: 25m 26s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 19h 39m | Avg: 33m 41s | Max:  1h 07m | Hits: 155%/9180  
      🟩 arm64              Pass: 100%/2   | Total: 58m 43s | Avg: 29m 21s | Max: 31m 31s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 15m | Avg: 39m 08s | Max:  1h 04m | Hits:  90%/1836  
      🟩 12.5               Pass: 100%/2   | Total:  1h 53m | Avg: 56m 38s | Max: 58m 32s
      🟩 12.6               Pass: 100%/30  | Total: 15h 29m | Avg: 30m 58s | Max:  1h 07m | Hits: 171%/7344  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 53m 20s | Avg: 26m 40s | Max: 27m 23s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 15m | Avg: 39m 08s | Max:  1h 04m | Hits:  90%/1836  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 53m | Avg: 56m 38s | Max: 58m 32s
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 35m | Avg: 31m 16s | Max:  1h 07m | Hits: 171%/7344  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 53m 20s | Avg: 26m 40s | Max: 27m 23s
      🟩 nvcc               Pass: 100%/35  | Total: 19h 44m | Avg: 33m 50s | Max:  1h 07m | Hits: 155%/9180  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 03m | Avg: 30m 56s | Max: 32m 29s
      🟩 Clang15            Pass: 100%/1   | Total: 30m 49s | Avg: 30m 49s | Max: 30m 49s
      🟩 Clang16            Pass: 100%/1   | Total: 32m 21s | Avg: 32m 21s | Max: 32m 21s
      🟩 Clang17            Pass: 100%/1   | Total: 34m 12s | Avg: 34m 12s | Max: 34m 12s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 44m | Avg: 23m 29s | Max: 33m 09s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 08m | Avg: 34m 01s | Max: 34m 42s
      🟩 GCC8               Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 48s | Max: 34m 41s
      🟩 GCC10              Pass: 100%/1   | Total: 31m 20s | Avg: 31m 20s | Max: 31m 20s
      🟩 GCC11              Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
      🟩 GCC12              Pass: 100%/1   | Total: 35m 27s | Avg: 35m 27s | Max: 35m 27s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 57m | Avg: 22m 10s | Max: 37m 11s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 04m | Hits:  92%/3672  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 47m | Avg: 55m 54s | Max:  1h 07m | Hits: 196%/5508  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 38s | Max: 58m 32s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 25m | Avg: 27m 32s | Max: 34m 12s
      🟩 GCC                Pass: 100%/16  | Total:  7h 23m | Avg: 27m 43s | Max: 37m 11s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 55m | Avg: 59m 09s | Max:  1h 07m | Hits: 155%/9180  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 53m | Avg: 56m 38s | Max: 58m 32s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 20h 38m | Avg: 33m 27s | Max:  1h 07m | Hits: 155%/9180  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 19h 09m | Avg: 37m 05s | Max:  1h 07m | Hits: 102%/7344  
      🟩 TestCPU            Pass: 100%/3   | Total: 51m 38s | Avg: 17m 12s | Max: 36m 09s | Hits: 365%/1836  
      🟩 TestGPU            Pass: 100%/3   | Total: 36m 32s | Avg: 12m 10s | Max: 13m 33s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 18m 28s | Avg: 18m 28s | Max: 18m 28s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  9h 29m | Avg: 40m 42s | Max:  1h 04m | Hits: 106%/5508  
      🟩 20                 Pass: 100%/21  | Total: 10h 31m | Avg: 30m 04s | Max:  1h 07m | Hits: 228%/3672  
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@bernhardmgruber bernhardmgruber added cub For all items related to CUB breaking Breaking change labels Jan 22, 2025
@elstehle elstehle force-pushed the enh/kernels-to-detail-ns branch from e4dcb79 to f0320dc Compare January 22, 2025 12:11
@elstehle elstehle requested a review from a team as a code owner January 22, 2025 12:11
@github-actions
Copy link
Contributor

🟩 CI finished in 2h 51m: Pass: 100%/78 | Total: 2d 08h | Avg: 43m 44s | Max: 1h 41m | Hits: 217%/12720
  • 🟩 cub: Pass: 100%/38 | Total: 1d 11h | Avg: 55m 55s | Max: 1h 41m | Hits: 232%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 09h | Avg: 55m 41s | Max:  1h 41m | Hits: 232%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 48m | Avg: 57m 45s | Max:  1h 03m | Hits: 232%/885   
      🟩 12.5               Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 12m
      🟩 12.6               Pass: 100%/31  | Total:  1d 04h | Avg: 54m 50s | Max:  1h 41m | Hits: 232%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 59m | Avg: 59m 43s | Max:  1h 02m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 48m | Avg: 57m 45s | Max:  1h 03m | Hits: 232%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 12m
      🟩 nvcc12.6           Pass: 100%/29  | Total:  1d 02h | Avg: 54m 30s | Max:  1h 41m | Hits: 232%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 43s | Max:  1h 02m
      🟩 nvcc               Pass: 100%/36  | Total:  1d 09h | Avg: 55m 43s | Max:  1h 41m | Hits: 232%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 53m | Avg: 58m 17s | Max:  1h 00m
      🟩 Clang15            Pass: 100%/1   | Total: 54m 22s | Avg: 54m 22s | Max: 54m 22s
      🟩 Clang16            Pass: 100%/1   | Total: 55m 33s | Avg: 55m 33s | Max: 55m 33s
      🟩 Clang17            Pass: 100%/1   | Total: 56m 39s | Avg: 56m 39s | Max: 56m 39s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 39m | Avg: 48m 30s | Max:  1h 02m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 49m | Avg: 54m 57s | Max: 55m 05s
      🟩 GCC8               Pass: 100%/1   | Total: 53m 36s | Avg: 53m 36s | Max: 53m 36s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 52m | Avg: 56m 14s | Max: 56m 51s
      🟩 GCC10              Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m
      🟩 GCC11              Pass: 100%/1   | Total: 59m 17s | Avg: 59m 17s | Max: 59m 17s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 42m | Avg: 34m 18s | Max: 57m 49s
      🟩 GCC13              Pass: 100%/8   | Total:  7h 59m | Avg: 59m 55s | Max:  1h 41m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 04m | Hits: 232%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 11m | Hits: 231%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 12m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 12h 19m | Avg: 52m 48s | Max:  1h 02m
      🟩 GCC                Pass: 100%/18  | Total: 16h 22m | Avg: 54m 33s | Max:  1h 41m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 27m | Avg:  1h 06m | Max:  1h 11m | Hits: 232%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 12m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 45m 07s | Avg: 22m 33s | Max: 25m 38s
      🟩 v100               Pass: 100%/36  | Total:  1d 10h | Avg: 57m 47s | Max:  1h 41m | Hits: 232%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 05h | Avg: 57m 47s | Max:  1h 12m | Hits: 232%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 33m | Avg:  1h 33m | Max:  1h 33m
      🟩 GraphCapture       Pass: 100%/1   | Total: 35m 18s | Avg: 35m 18s | Max: 35m 18s
      🟩 HostLaunch         Pass: 100%/3   | Total:  2h 19m | Avg: 46m 39s | Max:  1h 41m
      🟩 TestGPU            Pass: 100%/2   | Total:  1h 04m | Avg: 32m 24s | Max: 40m 10s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 45m 07s | Avg: 22m 33s | Max: 25m 38s
      🟩 90a                Pass: 100%/1   | Total: 25m 11s | Avg: 25m 11s | Max: 25m 11s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 13h 51m | Avg: 59m 21s | Max:  1h 08m | Hits: 232%/2655  
      🟩 20                 Pass: 100%/24  | Total: 21h 34m | Avg: 53m 55s | Max:  1h 41m | Hits: 231%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 20h 15m | Avg: 32m 51s | Max: 1h 04m | Hits: 212%/9180

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 41m 46s | Avg: 20m 53s | Max: 28m 15s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 19h 17m | Avg: 33m 04s | Max:  1h 04m | Hits: 212%/9180  
      🟩 arm64              Pass: 100%/2   | Total: 57m 59s | Avg: 28m 59s | Max: 30m 38s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 54m | Avg: 34m 49s | Max: 54m 12s | Hits: 173%/1836  
      🟩 12.5               Pass: 100%/2   | Total:  1h 51m | Avg: 55m 49s | Max: 57m 11s
      🟩 12.6               Pass: 100%/30  | Total: 15h 29m | Avg: 30m 59s | Max:  1h 04m | Hits: 221%/7344  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 54m 03s | Avg: 27m 01s | Max: 28m 09s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 54m | Avg: 34m 49s | Max: 54m 12s | Hits: 173%/1836  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 51m | Avg: 55m 49s | Max: 57m 11s
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 35m | Avg: 31m 16s | Max:  1h 04m | Hits: 221%/7344  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 54m 03s | Avg: 27m 01s | Max: 28m 09s
      🟩 nvcc               Pass: 100%/35  | Total: 19h 21m | Avg: 33m 11s | Max:  1h 04m | Hits: 212%/9180  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 01m | Avg: 30m 20s | Max: 32m 59s
      🟩 Clang15            Pass: 100%/1   | Total: 30m 34s | Avg: 30m 34s | Max: 30m 34s
      🟩 Clang16            Pass: 100%/1   | Total: 31m 11s | Avg: 31m 11s | Max: 31m 11s
      🟩 Clang17            Pass: 100%/1   | Total: 29m 57s | Avg: 29m 57s | Max: 29m 57s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 52m | Avg: 24m 35s | Max: 32m 05s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 01m | Avg: 30m 58s | Max: 32m 06s
      🟩 GCC8               Pass: 100%/1   | Total: 33m 12s | Avg: 33m 12s | Max: 33m 12s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 03m | Avg: 31m 53s | Max: 32m 28s
      🟩 GCC10              Pass: 100%/1   | Total: 31m 05s | Avg: 31m 05s | Max: 31m 05s
      🟩 GCC11              Pass: 100%/1   | Total: 34m 57s | Avg: 34m 57s | Max: 34m 57s
      🟩 GCC12              Pass: 100%/1   | Total: 35m 58s | Avg: 35m 58s | Max: 35m 58s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 02m | Avg: 22m 49s | Max: 34m 46s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 39s | Max: 57m 06s | Hits: 173%/3672  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 43m | Avg: 54m 39s | Max:  1h 04m | Hits: 237%/5508  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 49s | Max: 57m 11s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 25m | Avg: 27m 30s | Max: 32m 59s
      🟩 GCC                Pass: 100%/16  | Total:  7h 23m | Avg: 27m 43s | Max: 35m 58s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 35m | Avg: 55m 03s | Max:  1h 04m | Hits: 212%/9180  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 51m | Avg: 55m 49s | Max: 57m 11s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 20h 15m | Avg: 32m 51s | Max:  1h 04m | Hits: 212%/9180  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 18h 32m | Avg: 35m 54s | Max:  1h 04m | Hits: 173%/7344  
      🟩 TestCPU            Pass: 100%/3   | Total: 55m 31s | Avg: 18m 30s | Max: 39m 22s | Hits: 365%/1836  
      🟩 TestGPU            Pass: 100%/3   | Total: 47m 06s | Avg: 15m 42s | Max: 20m 05s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 19m 57s | Avg: 19m 57s | Max: 19m 57s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  8h 55m | Avg: 38m 13s | Max: 59m 48s | Hits: 173%/5508  
      🟩 20                 Pass: 100%/21  | Total: 10h 38m | Avg: 30m 24s | Max:  1h 04m | Hits: 269%/3672  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 12m 34s | Avg: 6m 17s | Max: 10m 15s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 12m 34s | Avg:  6m 17s | Max: 10m 15s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 19s | Avg:  2m 19s | Max:  2m 19s
      🟩 Test               Pass: 100%/1   | Total: 10m 15s | Avg: 10m 15s | Max: 10m 15s
    
  • 🟩 python: Pass: 100%/1 | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 58m 50s | Avg: 58m 50s | Max: 58m 50s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

Copy link
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me for the changes in c/parallel/src/reduce.cu

@elstehle elstehle merged commit 34f1c69 into NVIDIA:main Jan 22, 2025
90 of 93 checks passed
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
update docs

update docs

add `memcmp`, `memmove` and `memchr` implementations

implement tests

Use cuda::std::min/max in Thrust (NVIDIA#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (NVIDIA#2773)

Deprecate thrust::null_type (NVIDIA#3367)

Deprecate cub::DeviceSpmv (NVIDIA#3320)

Fixes: NVIDIA#896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (NVIDIA#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (NVIDIA#3385)

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (NVIDIA#3366)

Replace `typedef` with `using` in libcu++ (NVIDIA#3368)

Deprecate thrust::optional (NVIDIA#3307)

Fixes: NVIDIA#3306

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

refactor `<cuda/std/cstdint>` (NVIDIA#3325)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Update CODEOWNERS (NVIDIA#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes NVIDIA#3411

Extend CUB reduce benchmarks (NVIDIA#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: NVIDIA#3283

Update upload-pages-artifact to v3 (NVIDIA#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421)

`std::linalg` accessors and `transposed_layout` (NVIDIA#2962)

Add round up/down to multiple (NVIDIA#3234)

[FEA]: Introduce Python module with CCCL headers (NVIDIA#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* NVIDIA#3201 (comment)

* NVIDIA#3201 (comment)

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* NVIDIA#3201 (comment)

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d6.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a21.

Error message: NVIDIA#3201 (comment)

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd.

* Implement suggestion by @shwina (NVIDIA#3201 (review))

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Fix CI issues (NVIDIA#3443)

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436)

Default transform_iterator's copy ctor (NVIDIA#3395)

Fixes: NVIDIA#2393

Turn C++ dialect warning into error (NVIDIA#3453)

Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437)

* uses thrust's dynamic dispatch for merge_sort

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Refactor allocator handling of contiguous_storage (NVIDIA#3050)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Improve docs of std headers (NVIDIA#3416)

Drop C++11 and C++14 support for all of cccl (NVIDIA#3417)

* Drop C++11 and C++14 support for all of cccl

---------

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Deprecate a few CUB macros (NVIDIA#3456)

Deprecate thrust universal iterator categories (NVIDIA#3461)

Fix launch args order (NVIDIA#3465)

Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432)

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433)

Drop universal iterator categories (NVIDIA#3474)

Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

Refactor CUB's util_debug (NVIDIA#3345)
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 23, 2025
* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 23, 2025
Cleanup util_arch (NVIDIA#2773)

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Deprecate thrust::numeric_limits (NVIDIA#3366)

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

fix documentation
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 29, 2025
* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change cub For all items related to CUB

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Move CUB kernel entry points to a detail namespace

3 participants