Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` #3437

elstehle · 2025-01-17T13:14:02Z

Description

PR #3328 has limited the offset types kernel templates of DeviceMergeSort get instantiated for to unsigned offset types. We want to reflect the switch to unsigned offset types in thrust, so thrust can benefit from future tunings that we do for unsigned offset types.

github-actions · 2025-01-17T14:36:58Z

🟩 CI finished in 1h 20m: Pass: 100%/78 | Total: 1d 11h | Avg: 27m 02s | Max: 1h 01m | Hits: 393%/12760

🟩 cub: Pass: 100%/38 | Total: 23h 52m | Avg: 37m 42s | Max: 1h 01m | Hits: 523%/3540

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 22h 20m | Avg: 37m 14s | Max:  1h 01m | Hits: 523%/3540  
  🟩 arm64              Pass: 100%/2   | Total:  1h 31m | Avg: 45m 57s | Max: 47m 17s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 29m | Avg: 41m 57s | Max: 55m 03s | Hits: 523%/885   
  🟩 12.5               Pass: 100%/2   | Total:  1h 27m | Avg: 43m 50s | Max: 45m 23s
  🟩 12.6               Pass: 100%/31  | Total: 18h 55m | Avg: 36m 37s | Max:  1h 01m | Hits: 523%/2655  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 43m | Avg: 51m 30s | Max: 51m 38s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 29m | Avg: 41m 57s | Max: 55m 03s | Hits: 523%/885   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 27m | Avg: 43m 50s | Max: 45m 23s
  🟩 nvcc12.6           Pass: 100%/29  | Total: 17h 12m | Avg: 35m 35s | Max:  1h 01m | Hits: 523%/2655  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 43m | Avg: 51m 30s | Max: 51m 38s
  🟩 nvcc               Pass: 100%/36  | Total: 22h 09m | Avg: 36m 55s | Max:  1h 01m | Hits: 523%/3540  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  2h 36m | Avg: 39m 08s | Max: 42m 02s
  🟩 Clang15            Pass: 100%/1   | Total: 37m 46s | Avg: 37m 46s | Max: 37m 46s
  🟩 Clang16            Pass: 100%/1   | Total: 39m 33s | Avg: 39m 33s | Max: 39m 33s
  🟩 Clang17            Pass: 100%/1   | Total: 37m 44s | Avg: 37m 44s | Max: 37m 44s
  🟩 Clang18            Pass: 100%/7   | Total:  4h 35m | Avg: 39m 18s | Max: 51m 38s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 15m | Avg: 37m 47s | Max: 37m 52s
  🟩 GCC8               Pass: 100%/1   | Total: 36m 55s | Avg: 36m 55s | Max: 36m 55s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 19m | Avg: 39m 46s | Max: 40m 05s
  🟩 GCC10              Pass: 100%/1   | Total: 37m 50s | Avg: 37m 50s | Max: 37m 50s
  🟩 GCC11              Pass: 100%/1   | Total: 39m 19s | Avg: 39m 19s | Max: 39m 19s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 14m | Avg: 24m 55s | Max: 38m 05s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 40m | Avg: 27m 31s | Max: 44m 37s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 54m | Avg: 57m 03s | Max: 59m 04s | Hits: 523%/1770  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 57s | Max:  1h 01m | Hits: 523%/1770  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 27m | Avg: 43m 50s | Max: 45m 23s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  9h 06m | Avg: 39m 03s | Max: 51m 38s
  🟩 GCC                Pass: 100%/18  | Total:  9h 24m | Avg: 31m 20s | Max: 44m 37s
  🟩 MSVC               Pass: 100%/4   | Total:  3h 54m | Avg: 58m 30s | Max:  1h 01m | Hits: 523%/3540  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 27m | Avg: 43m 50s | Max: 45m 23s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 36m 42s | Avg: 18m 21s | Max: 19m 38s
  🟩 v100               Pass: 100%/36  | Total: 23h 15m | Avg: 38m 46s | Max:  1h 01m | Hits: 523%/3540  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 21h 24m | Avg: 41m 26s | Max:  1h 01m | Hits: 523%/3540  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 20m 44s | Avg: 20m 44s | Max: 20m 44s
  🟩 GraphCapture       Pass: 100%/1   | Total: 16m 59s | Avg: 16m 59s | Max: 16m 59s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 04m | Avg: 21m 21s | Max: 22m 36s
  🟩 TestGPU            Pass: 100%/2   | Total: 46m 07s | Avg: 23m 03s | Max: 25m 16s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 36m 42s | Avg: 18m 21s | Max: 19m 38s
  🟩 90a                Pass: 100%/1   | Total: 15m 57s | Avg: 15m 57s | Max: 15m 57s
🟩 std
  🟩 17                 Pass: 100%/14  | Total: 10h 19m | Avg: 44m 16s | Max: 59m 04s | Hits: 523%/2655  
  🟩 20                 Pass: 100%/24  | Total: 13h 32m | Avg: 33m 52s | Max:  1h 01m | Hits: 523%/885

🟩 thrust: Pass: 100%/37 | Total: 10h 27m | Avg: 16m 57s | Max: 40m 33s | Hits: 343%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 24m 13s | Avg: 12m 06s | Max: 12m 09s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total: 10h 03m | Avg: 17m 14s | Max: 40m 33s | Hits: 343%/9220  
  🟩 arm64              Pass: 100%/2   | Total: 23m 51s | Avg: 11m 55s | Max: 12m 24s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 28m | Avg: 17m 40s | Max: 33m 51s | Hits: 336%/1844  
  🟩 12.5               Pass: 100%/2   | Total: 57m 11s | Avg: 28m 35s | Max: 28m 49s
  🟩 12.6               Pass: 100%/30  | Total:  8h 01m | Avg: 16m 03s | Max: 40m 33s | Hits: 344%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 25m 37s | Avg: 12m 48s | Max: 13m 02s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 28m | Avg: 17m 40s | Max: 33m 51s | Hits: 336%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 57m 11s | Avg: 28m 35s | Max: 28m 49s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  7h 36m | Avg: 16m 17s | Max: 40m 33s | Hits: 344%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 25m 37s | Avg: 12m 48s | Max: 13m 02s
  🟩 nvcc               Pass: 100%/35  | Total: 10h 01m | Avg: 17m 11s | Max: 40m 33s | Hits: 343%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 54m 00s | Avg: 13m 30s | Max: 13m 50s
  🟩 Clang15            Pass: 100%/1   | Total: 14m 08s | Avg: 14m 08s | Max: 14m 08s
  🟩 Clang16            Pass: 100%/1   | Total: 12m 38s | Avg: 12m 38s | Max: 12m 38s
  🟩 Clang17            Pass: 100%/1   | Total: 12m 36s | Avg: 12m 36s | Max: 12m 36s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 21m | Avg: 11m 35s | Max: 13m 02s
  🟩 GCC7               Pass: 100%/2   | Total: 25m 44s | Avg: 12m 52s | Max: 13m 11s
  🟩 GCC8               Pass: 100%/1   | Total: 13m 21s | Avg: 13m 21s | Max: 13m 21s
  🟩 GCC9               Pass: 100%/2   | Total: 28m 39s | Avg: 14m 19s | Max: 14m 31s
  🟩 GCC10              Pass: 100%/1   | Total: 13m 58s | Avg: 13m 58s | Max: 13m 58s
  🟩 GCC11              Pass: 100%/1   | Total: 13m 00s | Avg: 13m 00s | Max: 13m 00s
  🟩 GCC12              Pass: 100%/1   | Total: 13m 57s | Avg: 13m 57s | Max: 13m 57s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 42m | Avg: 12m 50s | Max: 17m 22s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 09m | Avg: 34m 46s | Max: 35m 41s | Hits: 336%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 54m | Avg: 38m 16s | Max: 40m 33s | Hits: 347%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 57m 11s | Avg: 28m 35s | Max: 28m 49s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  2h 54m | Avg: 12m 27s | Max: 14m 08s
  🟩 GCC                Pass: 100%/16  | Total:  3h 31m | Avg: 13m 12s | Max: 17m 22s
  🟩 MSVC               Pass: 100%/5   | Total:  3h 04m | Avg: 36m 52s | Max: 40m 33s | Hits: 343%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total: 57m 11s | Avg: 28m 35s | Max: 28m 49s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 10h 27m | Avg: 16m 57s | Max: 40m 33s | Hits: 343%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  9h 01m | Avg: 17m 28s | Max: 40m 33s | Hits: 337%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 48m 56s | Avg: 16m 18s | Max: 34m 08s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 36m 39s | Avg: 12m 13s | Max: 12m 51s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 17m 22s | Avg: 17m 22s | Max: 17m 22s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 33m | Avg: 19m 30s | Max: 40m 08s | Hits: 338%/5532  
  🟩 20                 Pass: 100%/21  | Total:  5h 30m | Avg: 15m 42s | Max: 40m 33s | Hits: 351%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 50s | Avg: 4m 25s | Max: 6m 50s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 50s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 00s | Avg:  2m 00s | Max:  2m 00s
  🟩 Test               Pass: 100%/1   | Total:  6m 50s | Avg:  6m 50s | Max:  6m 50s

🟩 python: Pass: 100%/1 | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

thrust/thrust/system/cuda/detail/sort.h

github-actions · 2025-01-20T15:01:43Z

🟩 CI finished in 1h 22m: Pass: 100%/78 | Total: 1d 11h | Avg: 27m 41s | Max: 1h 03m | Hits: 393%/12720

🟩 cub: Pass: 100%/38 | Total: 1d 00h | Avg: 38m 57s | Max: 1h 03m | Hits: 523%/3540

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 23h 08m | Avg: 38m 34s | Max:  1h 03m | Hits: 523%/3540  
  🟩 arm64              Pass: 100%/2   | Total:  1h 31m | Avg: 45m 36s | Max: 47m 03s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 34m | Avg: 42m 55s | Max: 56m 41s | Hits: 523%/885   
  🟩 12.5               Pass: 100%/2   | Total:  1h 31m | Avg: 45m 32s | Max: 45m 34s
  🟩 12.6               Pass: 100%/31  | Total: 19h 34m | Avg: 37m 53s | Max:  1h 03m | Hits: 523%/2655  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 45m | Avg: 52m 40s | Max: 53m 29s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 34m | Avg: 42m 55s | Max: 56m 41s | Hits: 523%/885   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 31m | Avg: 45m 32s | Max: 45m 34s
  🟩 nvcc12.6           Pass: 100%/29  | Total: 17h 49m | Avg: 36m 52s | Max:  1h 03m | Hits: 523%/2655  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 45m | Avg: 52m 40s | Max: 53m 29s
  🟩 nvcc               Pass: 100%/36  | Total: 22h 54m | Avg: 38m 11s | Max:  1h 03m | Hits: 523%/3540  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  2h 37m | Avg: 39m 18s | Max: 40m 17s
  🟩 Clang15            Pass: 100%/1   | Total: 38m 51s | Avg: 38m 51s | Max: 38m 51s
  🟩 Clang16            Pass: 100%/1   | Total: 39m 44s | Avg: 39m 44s | Max: 39m 44s
  🟩 Clang17            Pass: 100%/1   | Total: 41m 12s | Avg: 41m 12s | Max: 41m 12s
  🟩 Clang18            Pass: 100%/7   | Total:  4h 41m | Avg: 40m 09s | Max: 53m 29s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 18m | Avg: 39m 18s | Max: 39m 58s
  🟩 GCC8               Pass: 100%/1   | Total: 37m 30s | Avg: 37m 30s | Max: 37m 30s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 18m | Avg: 39m 13s | Max: 39m 46s
  🟩 GCC10              Pass: 100%/1   | Total: 39m 44s | Avg: 39m 44s | Max: 39m 44s
  🟩 GCC11              Pass: 100%/1   | Total: 36m 58s | Avg: 36m 58s | Max: 36m 58s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 13m | Avg: 24m 23s | Max: 38m 02s
  🟩 GCC13              Pass: 100%/8   | Total:  4h 03m | Avg: 30m 26s | Max: 44m 09s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 28s | Max:  1h 02m | Hits: 523%/1770  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m | Hits: 523%/1770  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 31m | Avg: 45m 32s | Max: 45m 34s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  9h 18m | Avg: 39m 51s | Max: 53m 29s
  🟩 GCC                Pass: 100%/18  | Total:  9h 47m | Avg: 32m 39s | Max: 44m 09s
  🟩 MSVC               Pass: 100%/4   | Total:  4h 03m | Avg:  1h 00m | Max:  1h 03m | Hits: 523%/3540  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 31m | Avg: 45m 32s | Max: 45m 34s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 35m 08s | Avg: 17m 34s | Max: 19m 31s
  🟩 v100               Pass: 100%/36  | Total:  1d 00h | Avg: 40m 08s | Max:  1h 03m | Hits: 523%/3540  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 21h 44m | Avg: 42m 05s | Max:  1h 03m | Hits: 523%/3540  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 56s | Avg: 24m 56s | Max: 24m 56s
  🟩 GraphCapture       Pass: 100%/1   | Total: 19m 53s | Avg: 19m 53s | Max: 19m 53s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 08m | Avg: 22m 58s | Max: 25m 24s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 01m | Avg: 30m 49s | Max: 33m 10s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 35m 08s | Avg: 17m 34s | Max: 19m 31s
  🟩 90a                Pass: 100%/1   | Total: 16m 24s | Avg: 16m 24s | Max: 16m 24s
🟩 std
  🟩 17                 Pass: 100%/14  | Total: 10h 29m | Avg: 44m 55s | Max:  1h 03m | Hits: 523%/2655  
  🟩 20                 Pass: 100%/24  | Total: 14h 11m | Avg: 35m 27s | Max:  1h 00m | Hits: 523%/885

🟩 thrust: Pass: 100%/37 | Total: 10h 29m | Avg: 17m 00s | Max: 44m 13s | Hits: 342%/9180

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 25m 18s | Avg: 12m 39s | Max: 13m 43s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total: 10h 03m | Avg: 17m 15s | Max: 44m 13s | Hits: 342%/9180  
  🟩 arm64              Pass: 100%/2   | Total: 25m 05s | Avg: 12m 32s | Max: 13m 27s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 29m | Avg: 17m 57s | Max: 38m 00s | Hits: 336%/1836  
  🟩 12.5               Pass: 100%/2   | Total: 52m 59s | Avg: 26m 29s | Max: 26m 43s
  🟩 12.6               Pass: 100%/30  | Total:  8h 06m | Avg: 16m 12s | Max: 44m 13s | Hits: 344%/7344  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 24m 08s | Avg: 12m 04s | Max: 12m 11s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 29m | Avg: 17m 57s | Max: 38m 00s | Hits: 336%/1836  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 52m 59s | Avg: 26m 29s | Max: 26m 43s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  7h 42m | Avg: 16m 30s | Max: 44m 13s | Hits: 344%/7344  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 24m 08s | Avg: 12m 04s | Max: 12m 11s
  🟩 nvcc               Pass: 100%/35  | Total: 10h 04m | Avg: 17m 17s | Max: 44m 13s | Hits: 342%/9180  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 51m 38s | Avg: 12m 54s | Max: 14m 02s
  🟩 Clang15            Pass: 100%/1   | Total: 12m 45s | Avg: 12m 45s | Max: 12m 45s
  🟩 Clang16            Pass: 100%/1   | Total: 13m 08s | Avg: 13m 08s | Max: 13m 08s
  🟩 Clang17            Pass: 100%/1   | Total: 14m 10s | Avg: 14m 10s | Max: 14m 10s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 26m | Avg: 12m 23s | Max: 15m 46s
  🟩 GCC7               Pass: 100%/2   | Total: 27m 07s | Avg: 13m 33s | Max: 13m 48s
  🟩 GCC8               Pass: 100%/1   | Total: 12m 27s | Avg: 12m 27s | Max: 12m 27s
  🟩 GCC9               Pass: 100%/2   | Total: 26m 17s | Avg: 13m 08s | Max: 13m 30s
  🟩 GCC10              Pass: 100%/1   | Total: 13m 57s | Avg: 13m 57s | Max: 13m 57s
  🟩 GCC11              Pass: 100%/1   | Total: 14m 06s | Avg: 14m 06s | Max: 14m 06s
  🟩 GCC12              Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 39m | Avg: 12m 26s | Max: 15m 30s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 14m | Avg: 37m 04s | Max: 38m 00s | Hits: 336%/3672  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 55m | Avg: 38m 22s | Max: 44m 13s | Hits: 346%/5508  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 52m 59s | Avg: 26m 29s | Max: 26m 43s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  2h 58m | Avg: 12m 44s | Max: 15m 46s
  🟩 GCC                Pass: 100%/16  | Total:  3h 28m | Avg: 13m 01s | Max: 15m 30s
  🟩 MSVC               Pass: 100%/5   | Total:  3h 09m | Avg: 37m 51s | Max: 44m 13s | Hits: 342%/9180  
  🟩 NVHPC              Pass: 100%/2   | Total: 52m 59s | Avg: 26m 29s | Max: 26m 43s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 10h 29m | Avg: 17m 00s | Max: 44m 13s | Hits: 342%/9180  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  8h 52m | Avg: 17m 11s | Max: 44m 13s | Hits: 337%/7344  
  🟩 TestCPU            Pass: 100%/3   | Total: 51m 13s | Avg: 17m 04s | Max: 34m 39s | Hits: 365%/1836  
  🟩 TestGPU            Pass: 100%/3   | Total: 44m 59s | Avg: 14m 59s | Max: 15m 46s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  8m 58s | Avg:  8m 58s | Max:  8m 58s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 27m | Avg: 19m 06s | Max: 38m 00s | Hits: 337%/5508  
  🟩 20                 Pass: 100%/21  | Total:  5h 36m | Avg: 16m 00s | Max: 44m 13s | Hits: 350%/3672

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 15s | Avg: 4m 37s | Max: 7m 09s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  7m 09s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 06s | Avg:  2m 06s | Max:  2m 06s
  🟩 Test               Pass: 100%/1   | Total:  7m 09s | Avg:  7m 09s | Max:  7m 09s

🟩 python: Pass: 100%/1 | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

elstehle · 2025-01-21T09:57:52Z

pre-commit.ci autofix

copy-pr-bot · 2025-01-21T09:58:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elstehle · 2025-01-21T09:59:49Z

/ok to test

github-actions · 2025-01-21T11:30:32Z

🟩 CI finished in 1h 29m: Pass: 100%/78 | Total: 1d 10h | Avg: 26m 17s | Max: 1h 04m | Hits: 397%/12720

🟩 cub: Pass: 100%/38 | Total: 22h 57m | Avg: 36m 14s | Max: 1h 04m | Hits: 523%/3540

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 21h 26m | Avg: 35m 44s | Max:  1h 04m | Hits: 523%/3540  
  🟩 arm64              Pass: 100%/2   | Total:  1h 30m | Avg: 45m 18s | Max: 45m 25s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 31m | Avg: 42m 20s | Max: 55m 45s | Hits: 523%/885   
  🟩 12.5               Pass: 100%/2   | Total:  1h 27m | Avg: 43m 44s | Max: 45m 14s
  🟩 12.6               Pass: 100%/31  | Total: 17h 58m | Avg: 34m 46s | Max:  1h 04m | Hits: 523%/2655  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 14s | Avg:  4m 37s | Max:  4m 37s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 31m | Avg: 42m 20s | Max: 55m 45s | Hits: 523%/885   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 27m | Avg: 43m 44s | Max: 45m 14s
  🟩 nvcc12.6           Pass: 100%/29  | Total: 17h 48m | Avg: 36m 51s | Max:  1h 04m | Hits: 523%/2655  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 14s | Avg:  4m 37s | Max:  4m 37s
  🟩 nvcc               Pass: 100%/36  | Total: 22h 48m | Avg: 38m 00s | Max:  1h 04m | Hits: 523%/3540  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  2h 36m | Avg: 39m 00s | Max: 40m 50s
  🟩 Clang15            Pass: 100%/1   | Total: 39m 12s | Avg: 39m 12s | Max: 39m 12s
  🟩 Clang16            Pass: 100%/1   | Total: 37m 43s | Avg: 37m 43s | Max: 37m 43s
  🟩 Clang17            Pass: 100%/1   | Total: 37m 46s | Avg: 37m 46s | Max: 37m 46s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 58m | Avg: 25m 33s | Max: 45m 11s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 16m | Avg: 38m 21s | Max: 39m 07s
  🟩 GCC8               Pass: 100%/1   | Total: 39m 18s | Avg: 39m 18s | Max: 39m 18s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 20m | Avg: 40m 29s | Max: 41m 18s
  🟩 GCC10              Pass: 100%/1   | Total: 39m 29s | Avg: 39m 29s | Max: 39m 29s
  🟩 GCC11              Pass: 100%/1   | Total: 38m 51s | Avg: 38m 51s | Max: 38m 51s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 17m | Avg: 25m 56s | Max: 41m 02s
  🟩 GCC13              Pass: 100%/8   | Total:  4h 05m | Avg: 30m 44s | Max: 45m 25s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 12s | Max:  1h 02m | Hits: 523%/1770  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 04m | Hits: 523%/1770  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 27m | Avg: 43m 44s | Max: 45m 14s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  7h 29m | Avg: 32m 06s | Max: 45m 11s
  🟩 GCC                Pass: 100%/18  | Total:  9h 59m | Avg: 33m 17s | Max: 45m 25s
  🟩 MSVC               Pass: 100%/4   | Total:  4h 01m | Avg:  1h 00m | Max:  1h 04m | Hits: 523%/3540  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 27m | Avg: 43m 44s | Max: 45m 14s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 36m 47s | Avg: 18m 23s | Max: 19m 29s
  🟩 v100               Pass: 100%/36  | Total: 22h 20m | Avg: 37m 14s | Max:  1h 04m | Hits: 523%/3540  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 20h 06m | Avg: 38m 55s | Max:  1h 04m | Hits: 523%/3540  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 32m 04s | Avg: 32m 04s | Max: 32m 04s
  🟩 GraphCapture       Pass: 100%/1   | Total: 24m 16s | Avg: 24m 16s | Max: 24m 16s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 02m | Avg: 20m 44s | Max: 23m 11s
  🟩 TestGPU            Pass: 100%/2   | Total: 51m 54s | Avg: 25m 57s | Max: 27m 26s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 36m 47s | Avg: 18m 23s | Max: 19m 29s
  🟩 90a                Pass: 100%/1   | Total: 15m 39s | Avg: 15m 39s | Max: 15m 39s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  9h 41m | Avg: 41m 32s | Max:  1h 02m | Hits: 523%/2655  
  🟩 20                 Pass: 100%/24  | Total: 13h 15m | Avg: 33m 09s | Max:  1h 04m | Hits: 523%/885

🟩 thrust: Pass: 100%/37 | Total: 10h 20m | Avg: 16m 46s | Max: 38m 35s | Hits: 348%/9180

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 25m 55s | Avg: 12m 57s | Max: 13m 48s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total:  9h 55m | Avg: 17m 00s | Max: 38m 35s | Hits: 348%/9180  
  🟩 arm64              Pass: 100%/2   | Total: 25m 00s | Avg: 12m 30s | Max: 13m 15s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 26m | Avg: 17m 18s | Max: 31m 02s | Hits: 344%/1836  
  🟩 12.5               Pass: 100%/2   | Total: 56m 17s | Avg: 28m 08s | Max: 28m 22s
  🟩 12.6               Pass: 100%/30  | Total:  7h 57m | Avg: 15m 55s | Max: 38m 35s | Hits: 349%/7344  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 46s | Avg:  5m 23s | Max:  5m 24s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 26m | Avg: 17m 18s | Max: 31m 02s | Hits: 344%/1836  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 56m 17s | Avg: 28m 08s | Max: 28m 22s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  7h 46m | Avg: 16m 40s | Max: 38m 35s | Hits: 349%/7344  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 46s | Avg:  5m 23s | Max:  5m 24s
  🟩 nvcc               Pass: 100%/35  | Total: 10h 09m | Avg: 17m 25s | Max: 38m 35s | Hits: 348%/9180  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 54m 07s | Avg: 13m 31s | Max: 14m 07s
  🟩 Clang15            Pass: 100%/1   | Total: 13m 18s | Avg: 13m 18s | Max: 13m 18s
  🟩 Clang16            Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
  🟩 Clang17            Pass: 100%/1   | Total: 12m 47s | Avg: 12m 47s | Max: 12m 47s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 07m | Avg:  9m 40s | Max: 13m 25s
  🟩 GCC7               Pass: 100%/2   | Total: 28m 50s | Avg: 14m 25s | Max: 14m 55s
  🟩 GCC8               Pass: 100%/1   | Total: 13m 53s | Avg: 13m 53s | Max: 13m 53s
  🟩 GCC9               Pass: 100%/2   | Total: 28m 23s | Avg: 14m 11s | Max: 14m 30s
  🟩 GCC10              Pass: 100%/1   | Total: 14m 05s | Avg: 14m 05s | Max: 14m 05s
  🟩 GCC11              Pass: 100%/1   | Total: 14m 04s | Avg: 14m 04s | Max: 14m 04s
  🟩 GCC12              Pass: 100%/1   | Total: 13m 43s | Avg: 13m 43s | Max: 13m 43s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 54m | Avg: 14m 17s | Max: 29m 36s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 05m | Avg: 32m 54s | Max: 34m 46s | Hits: 344%/3672  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 50m | Avg: 36m 51s | Max: 38m 35s | Hits: 351%/5508  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 56m 17s | Avg: 28m 08s | Max: 28m 22s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  2h 40m | Avg: 11m 28s | Max: 14m 07s
  🟩 GCC                Pass: 100%/16  | Total:  3h 47m | Avg: 14m 12s | Max: 29m 36s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 56m | Avg: 35m 16s | Max: 38m 35s | Hits: 348%/9180  
  🟩 NVHPC              Pass: 100%/2   | Total: 56m 17s | Avg: 28m 08s | Max: 28m 22s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 10h 20m | Avg: 16m 46s | Max: 38m 35s | Hits: 348%/9180  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  8h 32m | Avg: 16m 31s | Max: 38m 35s | Hits: 344%/7344  
  🟩 TestCPU            Pass: 100%/3   | Total: 52m 44s | Avg: 17m 34s | Max: 37m 44s | Hits: 365%/1836  
  🟩 TestGPU            Pass: 100%/3   | Total: 55m 24s | Avg: 18m 28s | Max: 29m 36s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  8m 25s | Avg:  8m 25s | Max:  8m 25s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 18m | Avg: 18m 27s | Max: 34m 46s | Hits: 344%/5508  
  🟩 20                 Pass: 100%/21  | Total:  5h 36m | Avg: 16m 00s | Max: 38m 35s | Hits: 354%/3672

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 23s | Avg: 5m 41s | Max: 9m 20s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  9m 20s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 03s | Avg:  2m 03s | Max:  2m 03s
  🟩 Test               Pass: 100%/1   | Total:  9m 20s | Avg:  9m 20s | Max:  9m 20s

🟩 python: Pass: 100%/1 | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

…ispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

@shwina

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)

…ispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

elstehle requested review from a team as code owners January 17, 2025 13:14

elstehle requested a review from bernhardmgruber January 17, 2025 13:14

miscco approved these changes Jan 20, 2025

View reviewed changes

thrust/thrust/system/cuda/detail/sort.h Show resolved Hide resolved

uses thrust's dynamic dispatch for merge_sort

13ddb58

elstehle force-pushed the enh/thrust-merge-sort-offset-types branch from b6472f9 to 13ddb58 Compare January 20, 2025 13:37

[pre-commit.ci] auto code formatting

d68b947

bernhardmgruber approved these changes Jan 21, 2025

View reviewed changes

elstehle merged commit 46ad4c1 into NVIDIA:main Jan 21, 2025
90 of 93 checks passed

Uses unsigned offset types in thrust's sort algorithm calling into DispatchMergeSort #3437

Uses unsigned offset types in thrust's sort algorithm calling into DispatchMergeSort #3437

Uh oh!

Conversation

elstehle commented Jan 17, 2025

Description

Uh oh!

github-actions bot commented Jan 17, 2025

🟩 cub: Pass: 100%/38 | Total: 23h 52m | Avg: 37m 42s | Max: 1h 01m | Hits: 523%/3540

🟩 thrust: Pass: 100%/37 | Total: 10h 27m | Avg: 16m 57s | Max: 40m 33s | Hits: 343%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 50s | Avg: 4m 25s | Max: 6m 50s

🟩 python: Pass: 100%/1 | Total: 41m 06s | Avg: 41m 06s | Max: 41m 06s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

Uh oh!

Uh oh!

github-actions bot commented Jan 20, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 00h | Avg: 38m 57s | Max: 1h 03m | Hits: 523%/3540

🟩 thrust: Pass: 100%/37 | Total: 10h 29m | Avg: 17m 00s | Max: 44m 13s | Hits: 342%/9180

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 15s | Avg: 4m 37s | Max: 7m 09s

🟩 python: Pass: 100%/1 | Total: 41m 31s | Avg: 41m 31s | Max: 41m 31s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

Uh oh!

elstehle commented Jan 21, 2025

Uh oh!

copy-pr-bot bot commented Jan 21, 2025

Uh oh!

elstehle commented Jan 21, 2025

Uh oh!

github-actions bot commented Jan 21, 2025

🟩 cub: Pass: 100%/38 | Total: 22h 57m | Avg: 36m 14s | Max: 1h 04m | Hits: 523%/3540

🟩 thrust: Pass: 100%/37 | Total: 10h 20m | Avg: 16m 46s | Max: 38m 35s | Hits: 348%/9180

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 23s | Avg: 5m 41s | Max: 9m 20s

🟩 python: Pass: 100%/1 | Total: 41m 46s | Avg: 41m 46s | Max: 41m 46s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` #3437

Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` #3437