Improves `DeviceSegmentedSort` test run time for large number of items and segments #3246

elstehle · 2025-01-06T20:24:49Z

Description

Reduces the per-test run time from six minutes to six seconds.

Once this PR is merged, I'm planning to integrate a similar approach to DeviceSegmentedRadixSort in #3245.

The PR is touching two tests:

The test for verifying that large segments are sorted correctly
The test for verifying that a large number of segments are sorted correctly

For (1), we switched from invoking std::stable_sort as a means of verifying that the items were sorted correctly to using histograms over the input items. This lowered per-test-instance run time from six minutes to six seconds for these tests.

For (2), (a) tests never finished and (b) segment generation was generating overlapping segments, which lead to test failures, because it creates a race on which of the segments pointing to the same output region would be sorted first. So, we switched from generating random inputs to generating a sequence of 0, 1, 2, ..., max_histo_size-1, 0, 1, 2. We use a fixed segment size over this input sequence, chunking it up, say, every 1000 items. We then use an analytical model to compute the histogram over the input values for a given segment and use that histogram to understand what the sorted output range of that segment would look like. E.g., if we know 0 is repeated four times in the first segment, we know the sorted sequence should start with 0 and beginning at offset four should continue with key 1. So on and so forth.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2025-01-06T21:37:05Z

🟩 CI finished in 1h 07m: Pass: 100%/96 | Total: 20h 57m | Avg: 13m 05s | Max: 42m 53s | Hits: 98%/12392

🟩 cub: Pass: 100%/47 | Total: 13h 49m | Avg: 17m 39s | Max: 38m 22s | Hits: 94%/3132

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total: 13h 17m | Avg: 17m 43s | Max: 38m 22s | Hits:  94%/3132  
  🟩 arm64              Pass: 100%/2   | Total: 32m 40s | Avg: 16m 20s | Max: 16m 28s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total:  2h 09m | Avg: 18m 32s | Max: 38m 22s | Hits:  93%/783   
  🟩 12.5               Pass: 100%/2   | Total: 44m 38s | Avg: 22m 19s | Max: 22m 44s
  🟩 12.6               Pass: 100%/38  | Total: 10h 55m | Avg: 17m 15s | Max: 28m 52s | Hits:  94%/2349  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 23m 14s | Avg: 11m 37s | Max: 11m 42s
  🟩 nvcc11.1           Pass: 100%/7   | Total:  2h 09m | Avg: 18m 32s | Max: 38m 22s | Hits:  93%/783   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 44m 38s | Avg: 22m 19s | Max: 22m 44s
  🟩 nvcc12.6           Pass: 100%/36  | Total: 10h 32m | Avg: 17m 33s | Max: 28m 52s | Hits:  94%/2349  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 23m 14s | Avg: 11m 37s | Max: 11m 42s
  🟩 nvcc               Pass: 100%/45  | Total: 13h 26m | Avg: 17m 55s | Max: 38m 22s | Hits:  94%/3132  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  1h 04m | Avg: 16m 11s | Max: 18m 13s
  🟩 Clang10            Pass: 100%/1   | Total: 16m 51s | Avg: 16m 51s | Max: 16m 51s
  🟩 Clang11            Pass: 100%/1   | Total: 16m 55s | Avg: 16m 55s | Max: 16m 55s
  🟩 Clang12            Pass: 100%/1   | Total: 15m 03s | Avg: 15m 03s | Max: 15m 03s
  🟩 Clang13            Pass: 100%/1   | Total: 14m 34s | Avg: 14m 34s | Max: 14m 34s
  🟩 Clang14            Pass: 100%/1   | Total: 15m 55s | Avg: 15m 55s | Max: 15m 55s
  🟩 Clang15            Pass: 100%/1   | Total: 16m 48s | Avg: 16m 48s | Max: 16m 48s
  🟩 Clang16            Pass: 100%/1   | Total: 17m 07s | Avg: 17m 07s | Max: 17m 07s
  🟩 Clang17            Pass: 100%/1   | Total: 16m 19s | Avg: 16m 19s | Max: 16m 19s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 48m | Avg: 15m 31s | Max: 20m 34s
  🟩 GCC6               Pass: 100%/2   | Total: 31m 50s | Avg: 15m 55s | Max: 16m 01s
  🟩 GCC7               Pass: 100%/2   | Total: 34m 32s | Avg: 17m 16s | Max: 17m 50s
  🟩 GCC8               Pass: 100%/1   | Total: 16m 02s | Avg: 16m 02s | Max: 16m 02s
  🟩 GCC9               Pass: 100%/3   | Total: 47m 32s | Avg: 15m 50s | Max: 17m 21s
  🟩 GCC10              Pass: 100%/1   | Total: 15m 48s | Avg: 15m 48s | Max: 15m 48s
  🟩 GCC11              Pass: 100%/1   | Total: 15m 47s | Avg: 15m 47s | Max: 15m 47s
  🟩 GCC12              Pass: 100%/3   | Total: 42m 00s | Avg: 14m 00s | Max: 17m 43s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 19m | Avg: 17m 25s | Max: 25m 36s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total: 18m 58s | Avg: 18m 58s | Max: 18m 58s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 38m 22s | Avg: 38m 22s | Max: 38m 22s | Hits:  93%/783   
  🟩 MSVC14.29          Pass: 100%/1   | Total: 27m 27s | Avg: 27m 27s | Max: 27m 27s | Hits:  93%/783   
  🟩 MSVC14.39          Pass: 100%/2   | Total: 54m 39s | Avg: 27m 19s | Max: 28m 52s | Hits:  94%/1566  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 44m 38s | Avg: 22m 19s | Max: 22m 44s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  5h 02m | Avg: 15m 56s | Max: 20m 34s
  🟩 GCC                Pass: 100%/21  | Total:  5h 42m | Avg: 16m 19s | Max: 25m 36s
  🟩 Intel              Pass: 100%/1   | Total: 18m 58s | Avg: 18m 58s | Max: 18m 58s
  🟩 MSVC               Pass: 100%/4   | Total:  2h 00m | Avg: 30m 07s | Max: 38m 22s | Hits:  94%/3132  
  🟩 NVHPC              Pass: 100%/2   | Total: 44m 38s | Avg: 22m 19s | Max: 22m 44s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 24m 17s | Avg: 12m 08s | Max: 15m 57s
  🟩 v100               Pass: 100%/45  | Total: 13h 25m | Avg: 17m 54s | Max: 38m 22s | Hits:  94%/3132  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total: 11h 33m | Avg: 17m 20s | Max: 38m 22s | Hits:  94%/3132  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 20m 38s | Avg: 20m 38s | Max: 20m 38s
  🟩 GraphCapture       Pass: 100%/1   | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s
  🟩 HostLaunch         Pass: 100%/3   | Total: 54m 24s | Avg: 18m 08s | Max: 20m 00s
  🟩 TestGPU            Pass: 100%/2   | Total: 46m 10s | Avg: 23m 05s | Max: 25m 36s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 24m 17s | Avg: 12m 08s | Max: 15m 57s
  🟩 90a                Pass: 100%/1   | Total:  8m 53s | Avg:  8m 53s | Max:  8m 53s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  1h 21m | Avg: 16m 15s | Max: 17m 50s
  🟩 14                 Pass: 100%/4   | Total:  1h 29m | Avg: 22m 19s | Max: 38m 22s | Hits:  93%/783   
  🟩 17                 Pass: 100%/12  | Total:  3h 37m | Avg: 18m 06s | Max: 27m 27s | Hits:  93%/1566  
  🟩 20                 Pass: 100%/26  | Total:  7h 22m | Avg: 17m 00s | Max: 28m 52s | Hits:  94%/783

🟩 thrust: Pass: 100%/46 | Total: 6h 31m | Avg: 8m 30s | Max: 42m 53s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 19m 17s | Avg:  9m 38s | Max: 12m 57s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  6h 21m | Avg:  8m 40s | Max: 42m 53s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  9m 42s | Avg:  4m 51s | Max:  5m 13s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 44m 17s | Avg:  6m 19s | Max: 18m 28s | Hits:  99%/1852  
  🟩 12.5               Pass: 100%/2   | Total: 29m 19s | Avg: 14m 39s | Max: 15m 25s
  🟩 12.6               Pass: 100%/37  | Total:  5h 17m | Avg:  8m 34s | Max: 42m 53s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  5m 05s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 44m 17s | Avg:  6m 19s | Max: 18m 28s | Hits:  99%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 29m 19s | Avg: 14m 39s | Max: 15m 25s
  🟩 nvcc12.6           Pass: 100%/35  | Total:  5h 07m | Avg:  8m 46s | Max: 42m 53s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  5m 05s
  🟩 nvcc               Pass: 100%/44  | Total:  6h 21m | Avg:  8m 39s | Max: 42m 53s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 20m 50s | Avg:  5m 12s | Max:  6m 14s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 32s | Avg:  6m 32s | Max:  6m 32s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 09s | Avg:  5m 09s | Max:  5m 09s
  🟩 Clang12            Pass: 100%/1   | Total:  4m 57s | Avg:  4m 57s | Max:  4m 57s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 00s | Avg:  5m 00s | Max:  5m 00s
  🟩 Clang14            Pass: 100%/1   | Total:  4m 58s | Avg:  4m 58s | Max:  4m 58s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 32s | Avg:  5m 32s | Max:  5m 32s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 24s | Avg:  5m 24s | Max:  5m 24s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
  🟩 Clang18            Pass: 100%/7   | Total: 44m 09s | Avg:  6m 18s | Max: 10m 39s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 12s | Avg:  4m 06s | Max:  4m 31s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 22s | Avg:  5m 11s | Max:  5m 27s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 39s | Avg:  5m 39s | Max:  5m 39s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 52s | Avg:  4m 57s | Max:  6m 05s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 09s | Avg:  5m 09s | Max:  5m 09s
  🟩 GCC11              Pass: 100%/1   | Total: 42m 53s | Avg: 42m 53s | Max: 42m 53s
  🟩 GCC12              Pass: 100%/1   | Total:  5m 59s | Avg:  5m 59s | Max:  5m 59s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 01m | Avg:  7m 42s | Max: 12m 57s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  7m 04s | Avg:  7m 04s | Max:  7m 04s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 28s | Avg: 18m 28s | Max: 18m 28s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 16m 27s | Avg: 16m 27s | Max: 16m 27s | Hits:  99%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total: 56m 39s | Avg: 18m 53s | Max: 22m 33s | Hits:  99%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 29m 19s | Avg: 14m 39s | Max: 15m 25s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  1h 48m | Avg:  5m 42s | Max: 10m 39s
  🟩 GCC                Pass: 100%/19  | Total:  2h 34m | Avg:  8m 08s | Max: 42m 53s
  🟩 Intel              Pass: 100%/1   | Total:  7m 04s | Avg:  7m 04s | Max:  7m 04s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 31m | Avg: 18m 18s | Max: 22m 33s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total: 29m 19s | Avg: 14m 39s | Max: 15m 25s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  6h 31m | Avg:  8m 30s | Max: 42m 53s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  5h 16m | Avg:  7m 55s | Max: 42m 53s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 38m 48s | Avg: 12m 56s | Max: 22m 33s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 35m 25s | Avg: 11m 48s | Max: 12m 57s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 39s | Avg:  4m 39s | Max:  4m 39s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 22m 43s | Avg:  4m 32s | Max:  5m 46s
  🟩 14                 Pass: 100%/4   | Total: 34m 40s | Avg:  8m 40s | Max: 18m 28s | Hits:  99%/1852  
  🟩 17                 Pass: 100%/12  | Total:  1h 39m | Avg:  8m 17s | Max: 17m 42s | Hits:  99%/3704  
  🟩 20                 Pass: 100%/23  | Total:  3h 34m | Avg:  9m 20s | Max: 42m 53s | Hits:  99%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 50s | Avg: 4m 25s | Max: 6m 40s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 40s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 10s | Avg:  2m 10s | Max:  2m 10s
  🟩 Test               Pass: 100%/1   | Total:  6m 40s | Avg:  6m 40s | Max:  6m 40s

🟩 python: Pass: 100%/1 | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
71	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2025-01-07T05:06:23Z

🟩 CI finished in 1h 53m: Pass: 100%/96 | Total: 13h 56m | Avg: 8m 42s | Max: 35m 43s | Hits: 99%/12392

🟩 cub: Pass: 100%/47 | Total: 6h 37m | Avg: 8m 26s | Max: 24m 19s | Hits: 99%/3132

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total:  6h 26m | Avg:  8m 35s | Max: 24m 19s | Hits:  99%/3132  
  🟩 arm64              Pass: 100%/2   | Total: 10m 03s | Avg:  5m 01s | Max:  5m 10s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 41m 20s | Avg:  5m 54s | Max: 15m 11s | Hits:  99%/783   
  🟩 12.5               Pass: 100%/2   | Total: 18m 21s | Avg:  9m 10s | Max:  9m 14s
  🟩 12.6               Pass: 100%/38  | Total:  5h 37m | Avg:  8m 52s | Max: 24m 19s | Hits:  99%/2349  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 34s | Avg:  4m 17s | Max:  4m 19s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 41m 20s | Avg:  5m 54s | Max: 15m 11s | Hits:  99%/783   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 18m 21s | Avg:  9m 10s | Max:  9m 14s
  🟩 nvcc12.6           Pass: 100%/36  | Total:  5h 28m | Avg:  9m 07s | Max: 24m 19s | Hits:  99%/2349  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 34s | Avg:  4m 17s | Max:  4m 19s
  🟩 nvcc               Pass: 100%/45  | Total:  6h 28m | Avg:  8m 37s | Max: 24m 19s | Hits:  99%/3132  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 21m 29s | Avg:  5m 22s | Max:  6m 41s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 27s | Avg:  6m 27s | Max:  6m 27s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 29s | Avg:  5m 29s | Max:  5m 29s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 45s | Avg:  5m 45s | Max:  5m 45s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 22s | Avg:  5m 22s | Max:  5m 22s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 25s | Avg:  5m 25s | Max:  5m 25s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 06m | Avg:  9m 27s | Max: 24m 19s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 44s | Avg:  4m 22s | Max:  4m 27s
  🟩 GCC7               Pass: 100%/2   | Total: 11m 27s | Avg:  5m 43s | Max:  6m 12s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 11s | Avg:  5m 11s | Max:  5m 11s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 14s | Avg:  4m 44s | Max:  5m 39s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 39s | Avg:  5m 39s | Max:  5m 39s
  🟩 GCC12              Pass: 100%/3   | Total: 33m 03s | Avg: 11m 01s | Max: 22m 56s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 40m | Avg: 12m 31s | Max: 23m 16s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  6m 53s | Avg:  6m 53s | Max:  6m 53s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s | Hits:  99%/783   
  🟩 MSVC14.29          Pass: 100%/1   | Total: 12m 06s | Avg: 12m 06s | Max: 12m 06s | Hits:  99%/783   
  🟩 MSVC14.39          Pass: 100%/2   | Total: 27m 28s | Avg: 13m 44s | Max: 14m 05s | Hits:  99%/1566  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 18m 21s | Avg:  9m 10s | Max:  9m 14s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  2h 13m | Avg:  7m 00s | Max: 24m 19s
  🟩 GCC                Pass: 100%/21  | Total:  3h 03m | Avg:  8m 45s | Max: 23m 16s
  🟩 Intel              Pass: 100%/1   | Total:  6m 53s | Avg:  6m 53s | Max:  6m 53s
  🟩 MSVC               Pass: 100%/4   | Total: 54m 45s | Avg: 13m 41s | Max: 15m 11s | Hits:  99%/3132  
  🟩 NVHPC              Pass: 100%/2   | Total: 18m 21s | Avg:  9m 10s | Max:  9m 14s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 27m 15s | Avg: 13m 37s | Max: 22m 56s
  🟩 v100               Pass: 100%/45  | Total:  6h 09m | Avg:  8m 13s | Max: 24m 19s | Hits:  99%/3132  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  4h 13m | Avg:  6m 20s | Max: 15m 11s | Hits:  99%/3132  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 23s | Avg: 19m 23s | Max: 19m 23s
  🟩 GraphCapture       Pass: 100%/1   | Total: 19m 38s | Avg: 19m 38s | Max: 19m 38s
  🟩 HostLaunch         Pass: 100%/3   | Total: 56m 55s | Avg: 18m 58s | Max: 22m 56s
  🟩 TestGPU            Pass: 100%/2   | Total: 47m 35s | Avg: 23m 47s | Max: 24m 19s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 27m 15s | Avg: 13m 37s | Max: 22m 56s
  🟩 90a                Pass: 100%/1   | Total:  4m 44s | Avg:  4m 44s | Max:  4m 44s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 24m 02s | Avg:  4m 48s | Max:  5m 58s
  🟩 14                 Pass: 100%/4   | Total: 32m 31s | Avg:  8m 07s | Max: 15m 11s | Hits:  99%/783   
  🟩 17                 Pass: 100%/12  | Total:  1h 24m | Avg:  7m 00s | Max: 14m 05s | Hits:  99%/1566  
  🟩 20                 Pass: 100%/26  | Total:  4h 16m | Avg:  9m 51s | Max: 24m 19s | Hits:  99%/783

🟩 thrust: Pass: 100%/46 | Total: 6h 33m | Avg: 8m 33s | Max: 28m 00s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 21m 24s | Avg: 10m 42s | Max: 15m 22s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  6h 23m | Avg:  8m 42s | Max: 28m 00s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  5m 28s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total:  1h 10m | Avg: 10m 00s | Max: 28m 00s | Hits:  99%/1852  
  🟩 12.5               Pass: 100%/2   | Total: 32m 12s | Avg: 16m 06s | Max: 16m 13s
  🟩 12.6               Pass: 100%/37  | Total:  4h 51m | Avg:  7m 52s | Max: 21m 46s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 21s
  🟩 nvcc11.1           Pass: 100%/7   | Total:  1h 10m | Avg: 10m 00s | Max: 28m 00s | Hits:  99%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 32m 12s | Avg: 16m 06s | Max: 16m 13s
  🟩 nvcc12.6           Pass: 100%/35  | Total:  4h 40m | Avg:  8m 00s | Max: 21m 46s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 21s
  🟩 nvcc               Pass: 100%/44  | Total:  6h 22m | Avg:  8m 41s | Max: 28m 00s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 21m 19s | Avg:  5m 19s | Max:  6m 22s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 40s | Avg:  6m 40s | Max:  6m 40s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 38s | Avg:  5m 38s | Max:  5m 38s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 24s | Avg:  5m 24s | Max:  5m 24s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 15s | Avg:  5m 15s | Max:  5m 15s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 54s | Avg:  5m 54s | Max:  5m 54s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 50s | Avg:  5m 50s | Max:  5m 50s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 52s | Avg:  5m 52s | Max:  5m 52s
  🟩 Clang18            Pass: 100%/7   | Total: 45m 57s | Avg:  6m 33s | Max: 12m 28s
  🟩 GCC6               Pass: 100%/2   | Total: 32m 20s | Avg: 16m 10s | Max: 28m 00s
  🟩 GCC7               Pass: 100%/2   | Total:  9m 57s | Avg:  4m 58s | Max:  5m 27s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 27s | Avg:  5m 27s | Max:  5m 27s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 46s | Avg:  4m 55s | Max:  5m 58s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 51s | Avg:  5m 51s | Max:  5m 51s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
  🟩 GCC12              Pass: 100%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 08m | Avg:  8m 31s | Max: 16m 27s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  6m 31s | Avg:  6m 31s | Max:  6m 31s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 19m 36s | Avg: 19m 36s | Max: 19m 36s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 16m 46s | Avg: 16m 46s | Max: 16m 46s | Hits:  99%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total: 57m 16s | Avg: 19m 05s | Max: 21m 46s | Hits:  99%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 32m 12s | Avg: 16m 06s | Max: 16m 13s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  1h 53m | Avg:  5m 57s | Max: 12m 28s
  🟩 GCC                Pass: 100%/19  | Total:  2h 27m | Avg:  7m 47s | Max: 28m 00s
  🟩 Intel              Pass: 100%/1   | Total:  6m 31s | Avg:  6m 31s | Max:  6m 31s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 33m | Avg: 18m 43s | Max: 21m 46s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total: 32m 12s | Avg: 16m 06s | Max: 16m 13s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  6h 33m | Avg:  8m 33s | Max: 28m 00s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  5h 12m | Avg:  7m 48s | Max: 28m 00s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 36m 39s | Avg: 12m 13s | Max: 21m 46s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 44m 17s | Avg: 14m 45s | Max: 16m 27s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 48s | Avg:  4m 48s | Max:  4m 48s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 46m 47s | Avg:  9m 21s | Max: 28m 00s
  🟩 14                 Pass: 100%/4   | Total: 35m 45s | Avg:  8m 56s | Max: 19m 36s | Hits:  99%/1852  
  🟩 17                 Pass: 100%/12  | Total:  1h 40m | Avg:  8m 24s | Max: 16m 46s | Hits:  99%/3704  
  🟩 20                 Pass: 100%/23  | Total:  3h 08m | Avg:  8m 11s | Max: 21m 46s | Hits:  99%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 24s | Avg: 5m 12s | Max: 8m 17s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  8m 17s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
  🟩 Test               Pass: 100%/1   | Total:  8m 17s | Avg:  8m 17s | Max:  8m 17s

🟩 python: Pass: 100%/1 | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
71	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2025-01-09T12:41:25Z

🟨 CI finished in 1h 45m: Pass: 98%/92 | Total: 1d 03h | Avg: 18m 02s | Max: 1h 16m | Hits: 160%/9748

🟨 cub: Pass: 97%/45 | Total: 16h 40m | Avg: 22m 13s | Max: 1h 13m | Hits: 187%/2340

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  97%/43  | Total: 16h 05m | Avg: 22m 26s | Max:  1h 13m | Hits: 187%/2340  
  🟩 arm64              Pass: 100%/2   | Total: 35m 03s | Avg: 17m 31s | Max: 17m 34s
🔍 ctk: 12.6 🔍
  🟩 11.1               Pass: 100%/6   | Total:  1h 29m | Avg: 14m 59s | Max: 16m 35s
  🟩 12.5               Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
  🔍 12.6               Pass:  97%/37  | Total: 12h 49m | Avg: 20m 47s | Max:  1h 13m | Hits: 187%/2340  
🔍 cudacxx: nvcc12.6 🔍
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 24m 01s | Avg: 12m 00s | Max: 12m 30s
  🟩 nvcc11.1           Pass: 100%/6   | Total:  1h 29m | Avg: 14m 59s | Max: 16m 35s
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
  🔍 nvcc12.6           Pass:  97%/35  | Total: 12h 25m | Avg: 21m 18s | Max:  1h 13m | Hits: 187%/2340  
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/2   | Total: 24m 01s | Avg: 12m 00s | Max: 12m 30s
  🔍 nvcc               Pass:  97%/43  | Total: 16h 16m | Avg: 22m 41s | Max:  1h 13m | Hits: 187%/2340  
🔍 cxx: GCC13 🔍
  🟩 Clang9             Pass: 100%/4   | Total:  1h 03m | Avg: 15m 51s | Max: 16m 47s
  🟩 Clang10            Pass: 100%/1   | Total: 16m 21s | Avg: 16m 21s | Max: 16m 21s
  🟩 Clang11            Pass: 100%/1   | Total: 15m 31s | Avg: 15m 31s | Max: 15m 31s
  🟩 Clang12            Pass: 100%/1   | Total: 15m 20s | Avg: 15m 20s | Max: 15m 20s
  🟩 Clang13            Pass: 100%/1   | Total: 16m 35s | Avg: 16m 35s | Max: 16m 35s
  🟩 Clang14            Pass: 100%/1   | Total: 16m 08s | Avg: 16m 08s | Max: 16m 08s
  🟩 Clang15            Pass: 100%/1   | Total: 16m 11s | Avg: 16m 11s | Max: 16m 11s
  🟩 Clang16            Pass: 100%/1   | Total: 17m 29s | Avg: 17m 29s | Max: 17m 29s
  🟩 Clang17            Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 13m | Avg: 19m 00s | Max: 35m 55s
  🟩 GCC7               Pass: 100%/4   | Total:  1h 02m | Avg: 15m 33s | Max: 17m 31s
  🟩 GCC8               Pass: 100%/1   | Total: 16m 03s | Avg: 16m 03s | Max: 16m 03s
  🟩 GCC9               Pass: 100%/3   | Total: 47m 22s | Avg: 15m 47s | Max: 16m 35s
  🟩 GCC10              Pass: 100%/1   | Total: 15m 14s | Avg: 15m 14s | Max: 15m 14s
  🟩 GCC11              Pass: 100%/1   | Total: 14m 57s | Avg: 14m 57s | Max: 14m 57s
  🟩 GCC12              Pass: 100%/3   | Total: 40m 56s | Avg: 13m 38s | Max: 16m 00s
  🔍 GCC13              Pass:  87%/8   | Total:  2h 10m | Avg: 16m 21s | Max: 31m 15s
  🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits: 188%/780   
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 13m | Hits: 186%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
🔍 cxx_family: GCC 🔍
  🟩 Clang              Pass: 100%/19  | Total:  5h 24m | Avg: 17m 06s | Max: 35m 55s
  🔍 GCC                Pass:  95%/21  | Total:  5h 27m | Avg: 15m 36s | Max: 31m 15s
  🟩 MSVC               Pass: 100%/3   | Total:  3h 26m | Avg:  1h 08m | Max:  1h 13m | Hits: 187%/2340  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
🔍 gpu: v100 🔍
  🟩 h100               Pass: 100%/2   | Total: 25m 01s | Avg: 12m 30s | Max: 16m 00s
  🔍 v100               Pass:  97%/43  | Total: 16h 15m | Avg: 22m 40s | Max:  1h 13m | Hits: 187%/2340  
🚨 jobs: DeviceLaunch 🚨
  🟩 Build              Pass: 100%/38  | Total: 14h 09m | Avg: 22m 21s | Max:  1h 13m | Hits: 187%/2340  
  🔥 DeviceLaunch       Pass:   0%/1   | Total:  3m 21s | Avg:  3m 21s | Max:  3m 21s
  🟩 GraphCapture       Pass: 100%/1   | Total: 16m 08s | Avg: 16m 08s | Max: 16m 08s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 03m | Avg: 21m 12s | Max: 26m 50s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 35s | Max: 35m 55s
🔍 std: 20 🔍
  🟩 11                 Pass: 100%/5   | Total:  1h 19m | Avg: 15m 49s | Max: 16m 46s
  🟩 14                 Pass: 100%/2   | Total: 34m 18s | Avg: 17m 09s | Max: 17m 31s
  🟩 17                 Pass: 100%/12  | Total:  5h 36m | Avg: 28m 03s | Max:  1h 11m | Hits: 188%/1560  
  🔍 20                 Pass:  96%/26  | Total:  9h 09m | Avg: 21m 09s | Max:  1h 13m | Hits: 185%/780   
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 25m 01s | Avg: 12m 30s | Max: 16m 00s
  🟩 90a                Pass: 100%/1   | Total:  8m 54s | Avg:  8m 54s | Max:  8m 54s

🟩 thrust: Pass: 100%/44 | Total: 10h 18m | Avg: 14m 03s | Max: 1h 16m | Hits: 151%/7408

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 28m 14s | Avg: 14m 07s | Max: 22m 06s
🟩 cpu
  🟩 amd64              Pass: 100%/42  | Total: 10h 09m | Avg: 14m 30s | Max:  1h 16m | Hits: 151%/7408  
  🟩 arm64              Pass: 100%/2   | Total:  9m 21s | Avg:  4m 40s | Max:  4m 53s
🟩 ctk
  🟩 11.1               Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 27s
  🟩 12.5               Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
  🟩 12.6               Pass: 100%/36  | Total:  7h 29m | Avg: 12m 29s | Max:  1h 02m | Hits: 151%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 11s | Avg:  5m 05s | Max:  5m 18s
  🟩 nvcc11.1           Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 27s
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
  🟩 nvcc12.6           Pass: 100%/34  | Total:  7h 19m | Avg: 12m 55s | Max:  1h 02m | Hits: 151%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 11s | Avg:  5m 05s | Max:  5m 18s
  🟩 nvcc               Pass: 100%/42  | Total: 10h 08m | Avg: 14m 29s | Max:  1h 16m | Hits: 151%/7408  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 20m 00s | Avg:  5m 00s | Max:  6m 21s
  🟩 Clang10            Pass: 100%/1   | Total:  7m 11s | Avg:  7m 11s | Max:  7m 11s
  🟩 Clang11            Pass: 100%/1   | Total:  4m 55s | Avg:  4m 55s | Max:  4m 55s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 05s | Avg:  5m 05s | Max:  5m 05s
  🟩 Clang13            Pass: 100%/1   | Total:  4m 58s | Avg:  4m 58s | Max:  4m 58s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 30s | Avg:  5m 30s | Max:  5m 30s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 31s | Avg:  5m 31s | Max:  5m 31s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
  🟩 Clang18            Pass: 100%/7   | Total: 56m 29s | Avg:  8m 04s | Max: 22m 50s
  🟩 GCC7               Pass: 100%/4   | Total: 17m 49s | Avg:  4m 27s | Max:  5m 25s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 03s | Avg:  5m 03s | Max:  5m 03s
  🟩 GCC9               Pass: 100%/3   | Total: 13m 44s | Avg:  4m 34s | Max:  5m 34s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 08s | Avg:  6m 08s | Max:  6m 08s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 20m | Avg: 10m 01s | Max: 23m 04s
  🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  80%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 36m | Avg: 52m 18s | Max:  1h 02m | Hits: 175%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  2h 00m | Avg:  6m 19s | Max: 22m 50s
  🟩 GCC                Pass: 100%/19  | Total:  2h 13m | Avg:  7m 02s | Max: 23m 04s
  🟩 MSVC               Pass: 100%/4   | Total:  3h 39m | Avg: 54m 57s | Max:  1h 02m | Hits: 151%/7408  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
🟩 gpu
  🟩 v100               Pass: 100%/44  | Total: 10h 18m | Avg: 14m 03s | Max:  1h 16m | Hits: 151%/7408  
🟩 jobs
  🟩 Build              Pass: 100%/38  | Total:  8h 16m | Avg: 13m 04s | Max:  1h 16m | Hits:  80%/5556  
  🟩 TestCPU            Pass: 100%/3   | Total: 53m 51s | Avg: 17m 57s | Max: 37m 42s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 40s | Max: 23m 04s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 15s | Avg:  4m 15s | Max:  4m 15s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 21m 14s | Avg:  4m 14s | Max:  5m 21s
  🟩 14                 Pass: 100%/2   | Total: 11m 46s | Avg:  5m 53s | Max:  6m 21s
  🟩 17                 Pass: 100%/12  | Total:  4h 04m | Avg: 20m 21s | Max:  1h 16m | Hits:  80%/3704  
  🟩 20                 Pass: 100%/23  | Total:  5h 13m | Avg: 13m 37s | Max:  1h 08m | Hits: 223%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 45s | Avg: 6m 52s | Max: 11m 46s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 59s | Avg:  1m 59s | Max:  1m 59s
  🟩 Test               Pass: 100%/1   | Total: 11m 46s | Avg: 11m 46s | Max: 11m 46s

🟩 python: Pass: 100%/1 | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 92)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
7	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2025-01-09T13:32:48Z

🟩 CI finished in 2h 37m: Pass: 100%/92 | Total: 1d 03h | Avg: 18m 11s | Max: 1h 16m | Hits: 160%/9748

🟩 cub: Pass: 100%/45 | Total: 16h 54m | Avg: 22m 32s | Max: 1h 13m | Hits: 187%/2340

🟩 cpu
  🟩 amd64              Pass: 100%/43  | Total: 16h 19m | Avg: 22m 46s | Max:  1h 13m | Hits: 187%/2340  
  🟩 arm64              Pass: 100%/2   | Total: 35m 03s | Avg: 17m 31s | Max: 17m 34s
🟩 ctk
  🟩 11.1               Pass: 100%/6   | Total:  1h 29m | Avg: 14m 59s | Max: 16m 35s
  🟩 12.5               Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
  🟩 12.6               Pass: 100%/37  | Total: 13h 03m | Avg: 21m 11s | Max:  1h 13m | Hits: 187%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 24m 01s | Avg: 12m 00s | Max: 12m 30s
  🟩 nvcc11.1           Pass: 100%/6   | Total:  1h 29m | Avg: 14m 59s | Max: 16m 35s
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
  🟩 nvcc12.6           Pass: 100%/35  | Total: 12h 39m | Avg: 21m 42s | Max:  1h 13m | Hits: 187%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 24m 01s | Avg: 12m 00s | Max: 12m 30s
  🟩 nvcc               Pass: 100%/43  | Total: 16h 30m | Avg: 23m 01s | Max:  1h 13m | Hits: 187%/2340  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  1h 03m | Avg: 15m 51s | Max: 16m 47s
  🟩 Clang10            Pass: 100%/1   | Total: 16m 21s | Avg: 16m 21s | Max: 16m 21s
  🟩 Clang11            Pass: 100%/1   | Total: 15m 31s | Avg: 15m 31s | Max: 15m 31s
  🟩 Clang12            Pass: 100%/1   | Total: 15m 20s | Avg: 15m 20s | Max: 15m 20s
  🟩 Clang13            Pass: 100%/1   | Total: 16m 35s | Avg: 16m 35s | Max: 16m 35s
  🟩 Clang14            Pass: 100%/1   | Total: 16m 08s | Avg: 16m 08s | Max: 16m 08s
  🟩 Clang15            Pass: 100%/1   | Total: 16m 11s | Avg: 16m 11s | Max: 16m 11s
  🟩 Clang16            Pass: 100%/1   | Total: 17m 29s | Avg: 17m 29s | Max: 17m 29s
  🟩 Clang17            Pass: 100%/1   | Total: 14m 53s | Avg: 14m 53s | Max: 14m 53s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 13m | Avg: 19m 00s | Max: 35m 55s
  🟩 GCC7               Pass: 100%/4   | Total:  1h 02m | Avg: 15m 33s | Max: 17m 31s
  🟩 GCC8               Pass: 100%/1   | Total: 16m 03s | Avg: 16m 03s | Max: 16m 03s
  🟩 GCC9               Pass: 100%/3   | Total: 47m 22s | Avg: 15m 47s | Max: 16m 35s
  🟩 GCC10              Pass: 100%/1   | Total: 15m 14s | Avg: 15m 14s | Max: 15m 14s
  🟩 GCC11              Pass: 100%/1   | Total: 14m 57s | Avg: 14m 57s | Max: 14m 57s
  🟩 GCC12              Pass: 100%/3   | Total: 40m 56s | Avg: 13m 38s | Max: 16m 00s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 25m | Avg: 18m 09s | Max: 31m 15s
  🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits: 188%/780   
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 13m | Hits: 186%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  5h 24m | Avg: 17m 06s | Max: 35m 55s
  🟩 GCC                Pass: 100%/21  | Total:  5h 42m | Avg: 16m 17s | Max: 31m 15s
  🟩 MSVC               Pass: 100%/3   | Total:  3h 26m | Avg:  1h 08m | Max:  1h 13m | Hits: 187%/2340  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 11m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 25m 01s | Avg: 12m 30s | Max: 16m 00s
  🟩 v100               Pass: 100%/43  | Total: 16h 29m | Avg: 23m 00s | Max:  1h 13m | Hits: 187%/2340  
🟩 jobs
  🟩 Build              Pass: 100%/38  | Total: 14h 09m | Avg: 22m 21s | Max:  1h 13m | Hits: 187%/2340  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 17m 42s | Avg: 17m 42s | Max: 17m 42s
  🟩 GraphCapture       Pass: 100%/1   | Total: 16m 08s | Avg: 16m 08s | Max: 16m 08s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 03m | Avg: 21m 12s | Max: 26m 50s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 35s | Max: 35m 55s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 25m 01s | Avg: 12m 30s | Max: 16m 00s
  🟩 90a                Pass: 100%/1   | Total:  8m 54s | Avg:  8m 54s | Max:  8m 54s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  1h 19m | Avg: 15m 49s | Max: 16m 46s
  🟩 14                 Pass: 100%/2   | Total: 34m 18s | Avg: 17m 09s | Max: 17m 31s
  🟩 17                 Pass: 100%/12  | Total:  5h 36m | Avg: 28m 03s | Max:  1h 11m | Hits: 188%/1560  
  🟩 20                 Pass: 100%/26  | Total:  9h 24m | Avg: 21m 42s | Max:  1h 13m | Hits: 185%/780

🟩 thrust: Pass: 100%/44 | Total: 10h 18m | Avg: 14m 03s | Max: 1h 16m | Hits: 151%/7408

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 28m 14s | Avg: 14m 07s | Max: 22m 06s
🟩 cpu
  🟩 amd64              Pass: 100%/42  | Total: 10h 09m | Avg: 14m 30s | Max:  1h 16m | Hits: 151%/7408  
  🟩 arm64              Pass: 100%/2   | Total:  9m 21s | Avg:  4m 40s | Max:  4m 53s
🟩 ctk
  🟩 11.1               Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 27s
  🟩 12.5               Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
  🟩 12.6               Pass: 100%/36  | Total:  7h 29m | Avg: 12m 29s | Max:  1h 02m | Hits: 151%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 11s | Avg:  5m 05s | Max:  5m 18s
  🟩 nvcc11.1           Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 27s
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
  🟩 nvcc12.6           Pass: 100%/34  | Total:  7h 19m | Avg: 12m 55s | Max:  1h 02m | Hits: 151%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 11s | Avg:  5m 05s | Max:  5m 18s
  🟩 nvcc               Pass: 100%/42  | Total: 10h 08m | Avg: 14m 29s | Max:  1h 16m | Hits: 151%/7408  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 20m 00s | Avg:  5m 00s | Max:  6m 21s
  🟩 Clang10            Pass: 100%/1   | Total:  7m 11s | Avg:  7m 11s | Max:  7m 11s
  🟩 Clang11            Pass: 100%/1   | Total:  4m 55s | Avg:  4m 55s | Max:  4m 55s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 05s | Avg:  5m 05s | Max:  5m 05s
  🟩 Clang13            Pass: 100%/1   | Total:  4m 58s | Avg:  4m 58s | Max:  4m 58s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 30s | Avg:  5m 30s | Max:  5m 30s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 31s | Avg:  5m 31s | Max:  5m 31s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
  🟩 Clang18            Pass: 100%/7   | Total: 56m 29s | Avg:  8m 04s | Max: 22m 50s
  🟩 GCC7               Pass: 100%/4   | Total: 17m 49s | Avg:  4m 27s | Max:  5m 25s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 03s | Avg:  5m 03s | Max:  5m 03s
  🟩 GCC9               Pass: 100%/3   | Total: 13m 44s | Avg:  4m 34s | Max:  5m 34s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 08s | Avg:  6m 08s | Max:  6m 08s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 20m | Avg: 10m 01s | Max: 23m 04s
  🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  80%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 36m | Avg: 52m 18s | Max:  1h 02m | Hits: 175%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  2h 00m | Avg:  6m 19s | Max: 22m 50s
  🟩 GCC                Pass: 100%/19  | Total:  2h 13m | Avg:  7m 02s | Max: 23m 04s
  🟩 MSVC               Pass: 100%/4   | Total:  3h 39m | Avg: 54m 57s | Max:  1h 02m | Hits: 151%/7408  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 16m
🟩 gpu
  🟩 v100               Pass: 100%/44  | Total: 10h 18m | Avg: 14m 03s | Max:  1h 16m | Hits: 151%/7408  
🟩 jobs
  🟩 Build              Pass: 100%/38  | Total:  8h 16m | Avg: 13m 04s | Max:  1h 16m | Hits:  80%/5556  
  🟩 TestCPU            Pass: 100%/3   | Total: 53m 51s | Avg: 17m 57s | Max: 37m 42s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 40s | Max: 23m 04s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 15s | Avg:  4m 15s | Max:  4m 15s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 21m 14s | Avg:  4m 14s | Max:  5m 21s
  🟩 14                 Pass: 100%/2   | Total: 11m 46s | Avg:  5m 53s | Max:  6m 21s
  🟩 17                 Pass: 100%/12  | Total:  4h 04m | Avg: 20m 21s | Max:  1h 16m | Hits:  80%/3704  
  🟩 20                 Pass: 100%/23  | Total:  5h 13m | Avg: 13m 37s | Max:  1h 08m | Hits: 223%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 45s | Avg: 6m 52s | Max: 11m 46s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 13m 45s | Avg:  6m 52s | Max: 11m 46s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 59s | Avg:  1m 59s | Max:  1m 59s
  🟩 Test               Pass: 100%/1   | Total: 11m 46s | Avg: 11m 46s | Max: 11m 46s

🟩 python: Pass: 100%/1 | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 92)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
7	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2025-01-09T18:19:15Z

🟩 CI finished in 2h 01m: Pass: 100%/96 | Total: 2d 16h | Avg: 40m 14s | Max: 1h 06m | Hits: 303%/15012

🟩 cub: Pass: 100%/47 | Total: 1d 15h | Avg: 50m 27s | Max: 1h 06m | Hits: 410%/3900

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total:  1d 13h | Avg: 50m 11s | Max:  1h 06m | Hits: 410%/3900  
  🟩 arm64              Pass: 100%/2   | Total:  1h 53m | Avg: 56m 35s | Max: 58m 47s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  7h 21m | Avg: 55m 09s | Max:  1h 02m | Hits: 421%/1560  
  🟩 12.5               Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 06m
  🟩 12.6               Pass: 100%/37  | Total:  1d 06h | Avg: 48m 40s | Max:  1h 06m | Hits: 403%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 59m | Avg: 59m 36s | Max:  1h 00m
  🟩 nvcc12.0           Pass: 100%/8   | Total:  7h 21m | Avg: 55m 09s | Max:  1h 02m | Hits: 421%/1560  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 06m
  🟩 nvcc12.6           Pass: 100%/35  | Total:  1d 04h | Avg: 48m 03s | Max:  1h 06m | Hits: 403%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 36s | Max:  1h 00m
  🟩 nvcc               Pass: 100%/45  | Total:  1d 13h | Avg: 50m 03s | Max:  1h 06m | Hits: 410%/3900  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  3h 44m | Avg: 56m 01s | Max: 58m 33s
  🟩 Clang10            Pass: 100%/1   | Total: 54m 01s | Avg: 54m 01s | Max: 54m 01s
  🟩 Clang11            Pass: 100%/1   | Total: 53m 23s | Avg: 53m 23s | Max: 53m 23s
  🟩 Clang12            Pass: 100%/1   | Total: 51m 15s | Avg: 51m 15s | Max: 51m 15s
  🟩 Clang13            Pass: 100%/1   | Total: 55m 00s | Avg: 55m 00s | Max: 55m 00s
  🟩 Clang14            Pass: 100%/1   | Total: 52m 22s | Avg: 52m 22s | Max: 52m 22s
  🟩 Clang15            Pass: 100%/1   | Total: 56m 26s | Avg: 56m 26s | Max: 56m 26s
  🟩 Clang16            Pass: 100%/1   | Total: 57m 32s | Avg: 57m 32s | Max: 57m 32s
  🟩 Clang17            Pass: 100%/1   | Total: 51m 39s | Avg: 51m 39s | Max: 51m 39s
  🟩 Clang18            Pass: 100%/7   | Total:  6h 08m | Avg: 52m 34s | Max:  1h 00m
  🟩 GCC7               Pass: 100%/4   | Total:  3h 35m | Avg: 53m 55s | Max: 57m 03s
  🟩 GCC8               Pass: 100%/1   | Total: 56m 29s | Avg: 56m 29s | Max: 56m 29s
  🟩 GCC9               Pass: 100%/3   | Total:  2h 38m | Avg: 52m 59s | Max: 54m 37s
  🟩 GCC10              Pass: 100%/1   | Total: 59m 36s | Avg: 59m 36s | Max: 59m 36s
  🟩 GCC11              Pass: 100%/1   | Total: 52m 30s | Avg: 52m 30s | Max: 52m 30s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 36m | Avg: 32m 12s | Max: 54m 34s
  🟩 GCC13              Pass: 100%/8   | Total:  4h 43m | Avg: 35m 25s | Max: 58m 51s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 48m | Avg: 56m 13s | Max:  1h 02m | Hits: 422%/2340  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 06m | Hits: 392%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 06m
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total: 17h 03m | Avg: 53m 53s | Max:  1h 00m
  🟩 GCC                Pass: 100%/21  | Total: 15h 23m | Avg: 43m 57s | Max: 59m 36s
  🟩 MSVC               Pass: 100%/5   | Total:  4h 55m | Avg: 59m 03s | Max:  1h 06m | Hits: 410%/3900  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 06m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 42m 04s | Avg: 21m 02s | Max: 26m 03s
  🟩 v100               Pass: 100%/45  | Total:  1d 14h | Avg: 51m 46s | Max:  1h 06m | Hits: 410%/3900  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  1d 12h | Avg: 54m 44s | Max:  1h 06m | Hits: 410%/3900  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 17s | Avg: 22m 17s | Max: 22m 17s
  🟩 GraphCapture       Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
  🟩 HostLaunch         Pass: 100%/3   | Total: 52m 41s | Avg: 17m 33s | Max: 19m 21s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 20m | Avg: 40m 05s | Max:  1h 00m
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 42m 04s | Avg: 21m 02s | Max: 26m 03s
  🟩 90a                Pass: 100%/1   | Total: 24m 58s | Avg: 24m 58s | Max: 24m 58s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  4h 29m | Avg: 53m 51s | Max: 58m 33s
  🟩 14                 Pass: 100%/3   | Total:  2h 51m | Avg: 57m 13s | Max:  1h 02m | Hits: 421%/780   
  🟩 17                 Pass: 100%/13  | Total: 12h 07m | Avg: 55m 59s | Max:  1h 02m | Hits: 410%/2340  
  🟩 20                 Pass: 100%/26  | Total: 20h 02m | Avg: 46m 16s | Max:  1h 06m | Hits: 401%/780

🟩 thrust: Pass: 100%/46 | Total: 1d 00h | Avg: 31m 37s | Max: 56m 10s | Hits: 266%/11112

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 41s | Avg: 18m 50s | Max: 25m 58s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total: 23h 14m | Avg: 31m 41s | Max: 56m 10s | Hits: 266%/11112 
  🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 08s | Max: 31m 42s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  4h 36m | Avg: 34m 34s | Max: 54m 01s | Hits: 248%/3704  
  🟩 12.5               Pass: 100%/2   | Total:  1h 41m | Avg: 50m 39s | Max: 51m 23s
  🟩 12.6               Pass: 100%/36  | Total: 17h 57m | Avg: 29m 55s | Max: 56m 10s | Hits: 274%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 53m 15s | Avg: 26m 37s | Max: 27m 43s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  4h 36m | Avg: 34m 34s | Max: 54m 01s | Hits: 248%/3704  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 41m | Avg: 50m 39s | Max: 51m 23s
  🟩 nvcc12.6           Pass: 100%/34  | Total: 17h 03m | Avg: 30m 06s | Max: 56m 10s | Hits: 274%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 53m 15s | Avg: 26m 37s | Max: 27m 43s
  🟩 nvcc               Pass: 100%/44  | Total: 23h 21m | Avg: 31m 51s | Max: 56m 10s | Hits: 266%/11112 
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  1h 57m | Avg: 29m 26s | Max: 33m 07s
  🟩 Clang10            Pass: 100%/1   | Total: 33m 21s | Avg: 33m 21s | Max: 33m 21s
  🟩 Clang11            Pass: 100%/1   | Total: 31m 12s | Avg: 31m 12s | Max: 31m 12s
  🟩 Clang12            Pass: 100%/1   | Total: 33m 06s | Avg: 33m 06s | Max: 33m 06s
  🟩 Clang13            Pass: 100%/1   | Total: 30m 41s | Avg: 30m 41s | Max: 30m 41s
  🟩 Clang14            Pass: 100%/1   | Total: 29m 14s | Avg: 29m 14s | Max: 29m 14s
  🟩 Clang15            Pass: 100%/1   | Total: 31m 56s | Avg: 31m 56s | Max: 31m 56s
  🟩 Clang16            Pass: 100%/1   | Total: 33m 28s | Avg: 33m 28s | Max: 33m 28s
  🟩 Clang17            Pass: 100%/1   | Total: 31m 12s | Avg: 31m 12s | Max: 31m 12s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 43m | Avg: 23m 17s | Max: 32m 23s
  🟩 GCC7               Pass: 100%/4   | Total:  1h 53m | Avg: 28m 15s | Max: 32m 20s
  🟩 GCC8               Pass: 100%/1   | Total: 31m 45s | Avg: 31m 45s | Max: 31m 45s
  🟩 GCC9               Pass: 100%/3   | Total:  1h 35m | Avg: 31m 53s | Max: 35m 21s
  🟩 GCC10              Pass: 100%/1   | Total: 31m 20s | Avg: 31m 20s | Max: 31m 20s
  🟩 GCC11              Pass: 100%/1   | Total: 31m 21s | Avg: 31m 21s | Max: 31m 21s
  🟩 GCC12              Pass: 100%/1   | Total: 35m 50s | Avg: 35m 50s | Max: 35m 50s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 01m | Avg: 22m 38s | Max: 37m 57s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 33m | Avg: 51m 06s | Max: 54m 01s | Hits: 248%/5556  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 25m | Avg: 48m 26s | Max: 56m 10s | Hits: 283%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 41m | Avg: 50m 39s | Max: 51m 23s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  8h 54m | Avg: 28m 09s | Max: 33m 28s
  🟩 GCC                Pass: 100%/19  | Total:  8h 40m | Avg: 27m 22s | Max: 37m 57s
  🟩 MSVC               Pass: 100%/6   | Total:  4h 58m | Avg: 49m 46s | Max: 56m 10s | Hits: 266%/11112 
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 41m | Avg: 50m 39s | Max: 51m 23s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  1d 00h | Avg: 31m 37s | Max: 56m 10s | Hits: 266%/11112 
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total: 22h 44m | Avg: 34m 07s | Max: 56m 10s | Hits: 246%/9260  
  🟩 TestCPU            Pass: 100%/3   | Total: 50m 53s | Avg: 16m 57s | Max: 35m 57s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 39m 31s | Avg: 13m 10s | Max: 16m 06s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 18m 36s | Avg: 18m 36s | Max: 18m 36s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  2h 08m | Avg: 25m 37s | Max: 27m 18s
  🟩 14                 Pass: 100%/3   | Total:  1h 52m | Avg: 37m 28s | Max: 48m 41s | Hits: 247%/1852  
  🟩 17                 Pass: 100%/13  | Total:  8h 18m | Avg: 38m 21s | Max: 54m 01s | Hits: 247%/5556  
  🟩 20                 Pass: 100%/23  | Total: 11h 18m | Avg: 29m 29s | Max: 56m 10s | Hits: 303%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 13s | Avg: 5m 06s | Max: 8m 04s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 13s | Avg:  5m 06s | Max:  8m 04s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 09s | Avg:  2m 09s | Max:  2m 09s
  🟩 Test               Pass: 100%/1   | Total:  8m 04s | Avg:  8m 04s | Max:  8m 04s

🟩 python: Pass: 100%/1 | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
11	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

fbusato · 2025-01-10T01:28:14Z

could you please summarize the changes that helped to reduce the runtime?

elstehle · 2025-01-10T05:10:53Z

The PR is touching two tests:

The test for verifying that large segments are sorted correctly
The test for verifying that a large number of segments are sorted correctly

For (1), we switched from invoking std::stable_sort as a means of verifying that the items were sorted correctly to using histograms over the input items. This lowered per-test-instance run time from six minutes to six seconds for these tests.

For (2), (a) tests never finished and (b) segment generation was generating overlapping segments, which lead to test failures, because it creates a race on which of the segments pointing to the same output region would be sorted first. So, we switched from generating random inputs to generating a sequence of 0, 1, 2, ..., max_histo_size-1, 0, 1, 2. We use a fixed segment size over this input sequence, chunking it up, say, every 1000 items. We then use an analytical model to compute the histogram over the input values for a given segment and use that histogram to understand what the sorted output range of that segment would look like. E.g., if we know 0 is repeated four times in the first segment, we know the sorted sequence should start with 0 and beginning at offset four should continue with key 1. So on and so forth.

github-actions · 2025-01-14T06:26:03Z

🟩 CI finished in 1h 12m: Pass: 100%/78 | Total: 18h 43m | Avg: 14m 24s | Max: 38m 50s | Hits: 416%/12340

🟩 cub: Pass: 100%/38 | Total: 11h 52m | Avg: 18m 44s | Max: 38m 50s | Hits: 568%/3120

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 11h 19m | Avg: 18m 53s | Max: 38m 50s | Hits: 568%/3120  
  🟩 arm64              Pass: 100%/2   | Total: 32m 32s | Avg: 16m 16s | Max: 16m 52s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 43m | Avg: 20m 42s | Max: 36m 12s | Hits: 568%/780   
  🟩 12.5               Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 24m 45s
  🟩 12.6               Pass: 100%/31  | Total:  9h 21m | Avg: 18m 07s | Max: 38m 50s | Hits: 568%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 24m 42s | Avg: 12m 21s | Max: 12m 40s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 43m | Avg: 20m 42s | Max: 36m 12s | Hits: 568%/780   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 24m 45s
  🟩 nvcc12.6           Pass: 100%/29  | Total:  8h 57m | Avg: 18m 31s | Max: 38m 50s | Hits: 568%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 24m 42s | Avg: 12m 21s | Max: 12m 40s
  🟩 nvcc               Pass: 100%/36  | Total: 11h 27m | Avg: 19m 06s | Max: 38m 50s | Hits: 568%/3120  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  1h 06m | Avg: 16m 31s | Max: 17m 19s
  🟩 Clang15            Pass: 100%/1   | Total: 16m 52s | Avg: 16m 52s | Max: 16m 52s
  🟩 Clang16            Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
  🟩 Clang17            Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 58m | Avg: 16m 52s | Max: 23m 30s
  🟩 GCC7               Pass: 100%/2   | Total: 31m 18s | Avg: 15m 39s | Max: 16m 14s
  🟩 GCC8               Pass: 100%/1   | Total: 15m 14s | Avg: 15m 14s | Max: 15m 14s
  🟩 GCC9               Pass: 100%/2   | Total: 33m 43s | Avg: 16m 51s | Max: 17m 06s
  🟩 GCC10              Pass: 100%/1   | Total: 15m 40s | Avg: 15m 40s | Max: 15m 40s
  🟩 GCC11              Pass: 100%/1   | Total: 17m 04s | Avg: 17m 04s | Max: 17m 04s
  🟩 GCC12              Pass: 100%/3   | Total: 40m 18s | Avg: 13m 26s | Max: 15m 57s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 10m | Avg: 16m 16s | Max: 20m 45s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 13m | Avg: 36m 41s | Max: 37m 11s | Hits: 568%/1560  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 15m | Avg: 37m 45s | Max: 38m 50s | Hits: 568%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 24m 45s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  3h 53m | Avg: 16m 39s | Max: 23m 30s
  🟩 GCC                Pass: 100%/18  | Total:  4h 43m | Avg: 15m 44s | Max: 20m 45s
  🟩 MSVC               Pass: 100%/4   | Total:  2h 28m | Avg: 37m 13s | Max: 38m 50s | Hits: 568%/3120  
  🟩 NVHPC              Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 24m 45s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 24m 49s | Avg: 12m 24s | Max: 15m 57s
  🟩 v100               Pass: 100%/36  | Total: 11h 27m | Avg: 19m 06s | Max: 38m 50s | Hits: 568%/3120  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  9h 37m | Avg: 18m 38s | Max: 38m 50s | Hits: 568%/3120  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 18m 28s | Avg: 18m 28s | Max: 18m 28s
  🟩 GraphCapture       Pass: 100%/1   | Total: 15m 07s | Avg: 15m 07s | Max: 15m 07s
  🟩 HostLaunch         Pass: 100%/3   | Total: 57m 18s | Avg: 19m 06s | Max: 23m 30s
  🟩 TestGPU            Pass: 100%/2   | Total: 43m 50s | Avg: 21m 55s | Max: 23m 05s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 24m 49s | Avg: 12m 24s | Max: 15m 57s
  🟩 90a                Pass: 100%/1   | Total:  9m 16s | Avg:  9m 16s | Max:  9m 16s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 53m | Avg: 20m 58s | Max: 38m 50s | Hits: 568%/2340  
  🟩 20                 Pass: 100%/24  | Total:  6h 58m | Avg: 17m 27s | Max: 36m 41s | Hits: 568%/780

🟩 thrust: Pass: 100%/37 | Total: 6h 13m | Avg: 10m 06s | Max: 37m 15s | Hits: 365%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 17m 23s | Avg:  8m 41s | Max: 11m 31s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total:  6h 04m | Avg: 10m 24s | Max: 37m 15s | Hits: 365%/9220  
  🟩 arm64              Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  5m 00s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total: 48m 56s | Avg:  9m 47s | Max: 28m 42s | Hits: 365%/1844  
  🟩 12.5               Pass: 100%/2   | Total: 29m 44s | Avg: 14m 52s | Max: 15m 24s
  🟩 12.6               Pass: 100%/30  | Total:  4h 55m | Avg:  9m 50s | Max: 37m 15s | Hits: 365%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 33s | Avg:  5m 16s | Max:  5m 29s
  🟩 nvcc12.0           Pass: 100%/5   | Total: 48m 56s | Avg:  9m 47s | Max: 28m 42s | Hits: 365%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 29m 44s | Avg: 14m 52s | Max: 15m 24s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  4h 44m | Avg: 10m 09s | Max: 37m 15s | Hits: 365%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 33s | Avg:  5m 16s | Max:  5m 29s
  🟩 nvcc               Pass: 100%/35  | Total:  6h 03m | Avg: 10m 22s | Max: 37m 15s | Hits: 365%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 21m 09s | Avg:  5m 17s | Max:  5m 50s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
  🟩 Clang18            Pass: 100%/7   | Total: 49m 06s | Avg:  7m 00s | Max: 15m 32s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 29s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 45s | Avg:  5m 45s | Max:  5m 45s
  🟩 GCC9               Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  6m 00s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 30s | Avg:  5m 30s | Max:  5m 30s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 26s | Avg:  5m 26s | Max:  5m 26s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 18s | Avg:  6m 18s | Max:  6m 18s
  🟩 GCC13              Pass: 100%/8   | Total: 58m 47s | Avg:  7m 20s | Max: 12m 12s
  🟩 MSVC14.29          Pass: 100%/2   | Total: 56m 28s | Avg: 28m 14s | Max: 28m 42s | Hits: 365%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 36m | Avg: 32m 11s | Max: 37m 15s | Hits: 365%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 29m 44s | Avg: 14m 52s | Max: 15m 24s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  1h 27m | Avg:  6m 13s | Max: 15m 32s
  🟩 GCC                Pass: 100%/16  | Total:  1h 43m | Avg:  6m 29s | Max: 12m 12s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 33m | Avg: 30m 36s | Max: 37m 15s | Hits: 365%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total: 29m 44s | Avg: 14m 52s | Max: 15m 24s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total:  6h 13m | Avg: 10m 06s | Max: 37m 15s | Hits: 365%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  4h 42m | Avg:  9m 06s | Max: 30m 14s | Hits: 365%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 52m 07s | Avg: 17m 22s | Max: 37m 15s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 39m 15s | Avg: 13m 05s | Max: 15m 32s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 36s | Avg:  4m 36s | Max:  4m 36s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  2h 36m | Avg: 11m 12s | Max: 30m 14s | Hits: 365%/5532  
  🟩 20                 Pass: 100%/21  | Total:  3h 19m | Avg:  9m 29s | Max: 37m 15s | Hits: 365%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 27s | Avg: 5m 13s | Max: 8m 24s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  8m 24s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 03s | Avg:  2m 03s | Max:  2m 03s
  🟩 Test               Pass: 100%/1   | Total:  8m 24s | Avg:  8m 24s | Max:  8m 24s

🟩 python: Pass: 100%/1 | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

cub/test/catch2_radix_sort_helper.cuh

cub/test/catch2_test_device_segmented_sort_keys.cu

github-actions · 2025-01-14T09:26:13Z

🟩 CI finished in 1h 10m: Pass: 100%/78 | Total: 19h 37m | Avg: 15m 06s | Max: 37m 37s | Hits: 416%/12340

🟩 cub: Pass: 100%/38 | Total: 12h 56m | Avg: 20m 26s | Max: 37m 37s | Hits: 568%/3120

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 12h 24m | Avg: 20m 40s | Max: 37m 37s | Hits: 568%/3120  
  🟩 arm64              Pass: 100%/2   | Total: 32m 41s | Avg: 16m 20s | Max: 16m 41s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 39m | Avg: 19m 53s | Max: 33m 10s | Hits: 568%/780   
  🟩 12.5               Pass: 100%/2   | Total: 48m 12s | Avg: 24m 06s | Max: 24m 27s
  🟩 12.6               Pass: 100%/31  | Total: 10h 29m | Avg: 20m 17s | Max: 37m 37s | Hits: 568%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 24m 14s | Avg: 12m 07s | Max: 12m 17s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 39m | Avg: 19m 53s | Max: 33m 10s | Hits: 568%/780   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 48m 12s | Avg: 24m 06s | Max: 24m 27s
  🟩 nvcc12.6           Pass: 100%/29  | Total: 10h 04m | Avg: 20m 51s | Max: 37m 37s | Hits: 568%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 24m 14s | Avg: 12m 07s | Max: 12m 17s
  🟩 nvcc               Pass: 100%/36  | Total: 12h 32m | Avg: 20m 54s | Max: 37m 37s | Hits: 568%/3120  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  1h 08m | Avg: 17m 01s | Max: 17m 38s
  🟩 Clang15            Pass: 100%/1   | Total: 16m 03s | Avg: 16m 03s | Max: 16m 03s
  🟩 Clang16            Pass: 100%/1   | Total: 16m 17s | Avg: 16m 17s | Max: 16m 17s
  🟩 Clang17            Pass: 100%/1   | Total: 15m 56s | Avg: 15m 56s | Max: 15m 56s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 26m | Avg: 20m 58s | Max: 37m 33s
  🟩 GCC7               Pass: 100%/2   | Total: 33m 06s | Avg: 16m 33s | Max: 17m 23s
  🟩 GCC8               Pass: 100%/1   | Total: 16m 45s | Avg: 16m 45s | Max: 16m 45s
  🟩 GCC9               Pass: 100%/2   | Total: 32m 12s | Avg: 16m 06s | Max: 16m 23s
  🟩 GCC10              Pass: 100%/1   | Total: 16m 29s | Avg: 16m 29s | Max: 16m 29s
  🟩 GCC11              Pass: 100%/1   | Total: 17m 11s | Avg: 17m 11s | Max: 17m 11s
  🟩 GCC12              Pass: 100%/3   | Total: 42m 24s | Avg: 14m 08s | Max: 17m 44s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 47m | Avg: 20m 55s | Max: 37m 37s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 08m | Avg: 34m 11s | Max: 35m 13s | Hits: 568%/1560  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 11m | Avg: 35m 46s | Max: 37m 30s | Hits: 568%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 48m 12s | Avg: 24m 06s | Max: 24m 27s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  4h 23m | Avg: 18m 47s | Max: 37m 33s
  🟩 GCC                Pass: 100%/18  | Total:  5h 25m | Avg: 18m 05s | Max: 37m 37s
  🟩 MSVC               Pass: 100%/4   | Total:  2h 19m | Avg: 34m 58s | Max: 37m 30s | Hits: 568%/3120  
  🟩 NVHPC              Pass: 100%/2   | Total: 48m 12s | Avg: 24m 06s | Max: 24m 27s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 24m 40s | Avg: 12m 20s | Max: 16m 02s
  🟩 v100               Pass: 100%/36  | Total: 12h 32m | Avg: 20m 53s | Max: 37m 37s | Hits: 568%/3120  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  9h 36m | Avg: 18m 35s | Max: 37m 30s | Hits: 568%/3120  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 37m 37s | Avg: 37m 37s | Max: 37m 37s
  🟩 GraphCapture       Pass: 100%/1   | Total: 16m 47s | Avg: 16m 47s | Max: 16m 47s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 12m | Avg: 24m 10s | Max: 37m 33s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 13m | Avg: 36m 45s | Max: 37m 00s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 24m 40s | Avg: 12m 20s | Max: 16m 02s
  🟩 90a                Pass: 100%/1   | Total:  8m 47s | Avg:  8m 47s | Max:  8m 47s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 47m | Avg: 20m 30s | Max: 35m 13s | Hits: 568%/2340  
  🟩 20                 Pass: 100%/24  | Total:  8h 09m | Avg: 20m 24s | Max: 37m 37s | Hits: 568%/780

🟩 thrust: Pass: 100%/37 | Total: 6h 05m | Avg: 9m 52s | Max: 36m 03s | Hits: 365%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 17m 03s | Avg:  8m 31s | Max: 11m 29s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total:  5h 55m | Avg: 10m 09s | Max: 36m 03s | Hits: 365%/9220  
  🟩 arm64              Pass: 100%/2   | Total:  9m 41s | Avg:  4m 50s | Max:  5m 03s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total: 47m 32s | Avg:  9m 30s | Max: 26m 42s | Hits: 365%/1844  
  🟩 12.5               Pass: 100%/2   | Total: 29m 31s | Avg: 14m 45s | Max: 14m 50s
  🟩 12.6               Pass: 100%/30  | Total:  4h 48m | Avg:  9m 36s | Max: 36m 03s | Hits: 365%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 18s | Avg:  5m 09s | Max:  5m 25s
  🟩 nvcc12.0           Pass: 100%/5   | Total: 47m 32s | Avg:  9m 30s | Max: 26m 42s | Hits: 365%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 29m 31s | Avg: 14m 45s | Max: 14m 50s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  4h 37m | Avg:  9m 55s | Max: 36m 03s | Hits: 365%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 18s | Avg:  5m 09s | Max:  5m 25s
  🟩 nvcc               Pass: 100%/35  | Total:  5h 54m | Avg: 10m 08s | Max: 36m 03s | Hits: 365%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 20m 57s | Avg:  5m 14s | Max:  5m 45s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 36s | Avg:  5m 36s | Max:  5m 36s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 52s | Avg:  5m 52s | Max:  5m 52s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 25s | Avg:  5m 25s | Max:  5m 25s
  🟩 Clang18            Pass: 100%/7   | Total: 46m 31s | Avg:  6m 38s | Max: 12m 47s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  5m 19s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 43s | Avg:  5m 43s | Max:  5m 43s
  🟩 GCC9               Pass: 100%/2   | Total: 11m 16s | Avg:  5m 38s | Max:  5m 40s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 47s | Avg:  5m 47s | Max:  5m 47s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 43s | Avg:  5m 43s | Max:  5m 43s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 04s | Avg:  6m 04s | Max:  6m 04s
  🟩 GCC13              Pass: 100%/8   | Total: 59m 03s | Avg:  7m 22s | Max: 12m 54s
  🟩 MSVC14.29          Pass: 100%/2   | Total: 53m 06s | Avg: 26m 33s | Max: 26m 42s | Hits: 365%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 34m | Avg: 31m 24s | Max: 36m 03s | Hits: 365%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 29m 31s | Avg: 14m 45s | Max: 14m 50s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  1h 24m | Avg:  6m 01s | Max: 12m 47s
  🟩 GCC                Pass: 100%/16  | Total:  1h 44m | Avg:  6m 30s | Max: 12m 54s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 27m | Avg: 29m 27s | Max: 36m 03s | Hits: 365%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total: 29m 31s | Avg: 14m 45s | Max: 14m 50s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total:  6h 05m | Avg:  9m 52s | Max: 36m 03s | Hits: 365%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  4h 36m | Avg:  8m 54s | Max: 30m 22s | Hits: 365%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 51m 51s | Avg: 17m 17s | Max: 36m 03s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 37m 10s | Avg: 12m 23s | Max: 12m 54s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 31s | Avg:  4m 31s | Max:  4m 31s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  2h 30m | Avg: 10m 44s | Max: 27m 47s | Hits: 365%/5532  
  🟩 20                 Pass: 100%/21  | Total:  3h 17m | Avg:  9m 24s | Max: 36m 03s | Hits: 365%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 04s | Avg: 5m 02s | Max: 7m 06s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  7m 06s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 58s | Avg:  2m 58s | Max:  2m 58s
  🟩 Test               Pass: 100%/1   | Total:  7m 06s | Avg:  7m 06s | Max:  7m 06s

🟩 python: Pass: 100%/1 | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 54s | Avg: 25m 54s | Max: 25m 54s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

…s and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <shijiec@nvidia.com> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <shijiec@nvidia.com> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

…s and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes

@shwina

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)

…s and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes

Cleanup util_arch (NVIDIA#2773) Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Deprecate thrust::numeric_limits (NVIDIA#3366) Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 fix documentation

…s and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes

elstehle requested review from a team as code owners January 6, 2025 20:24

elstehle requested review from bernhardmgruber and wmaxey January 6, 2025 20:24

elstehle requested a review from fbusato January 9, 2025 07:36

elstehle force-pushed the enh/fix-large-seg-sort-testing-time branch from a5dc0db to 2228a87 Compare January 9, 2025 10:54

elstehle mentioned this pull request Jan 9, 2025

Adds support for large number of items and large number of segments to DeviceSegmentedSort #3308

Merged

2 tasks

elstehle force-pushed the enh/fix-large-seg-sort-testing-time branch from 2228a87 to b8cedc1 Compare January 9, 2025 16:16

elstehle added 7 commits January 13, 2025 21:11

fixes segment offset generation

2a830fb

switches to analytical verification

c53c69e

switches to analytical verification for pairs

2e4b7b6

fixes spelling

70e2e76

adds tests for large number of segments

6c8da4b

fixes narrowing conversion in tests

0b963f9

addresses review comments

742f993

elstehle force-pushed the enh/fix-large-seg-sort-testing-time branch from b8cedc1 to 742f993 Compare January 14, 2025 05:12

miscco approved these changes Jan 14, 2025

View reviewed changes

cub/test/catch2_radix_sort_helper.cuh Show resolved Hide resolved

cub/test/catch2_test_device_segmented_sort_keys.cu Outdated Show resolved Hide resolved

fixes includes

681a2a4

elstehle merged commit 64a419a into NVIDIA:main Jan 14, 2025
92 checks passed

Improves DeviceSegmentedSort test run time for large number of items and segments #3246

Improves DeviceSegmentedSort test run time for large number of items and segments #3246

Uh oh!

Conversation

elstehle commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

github-actions bot commented Jan 6, 2025

🟩 cub: Pass: 100%/47 | Total: 13h 49m | Avg: 17m 39s | Max: 38m 22s | Hits: 94%/3132

🟩 thrust: Pass: 100%/46 | Total: 6h 31m | Avg: 8m 30s | Max: 42m 53s | Hits: 99%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 50s | Avg: 4m 25s | Max: 6m 40s

🟩 python: Pass: 100%/1 | Total: 27m 07s | Avg: 27m 07s | Max: 27m 07s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 96)

Uh oh!

github-actions bot commented Jan 7, 2025

🟩 cub: Pass: 100%/47 | Total: 6h 37m | Avg: 8m 26s | Max: 24m 19s | Hits: 99%/3132

🟩 thrust: Pass: 100%/46 | Total: 6h 33m | Avg: 8m 33s | Max: 28m 00s | Hits: 99%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 24s | Avg: 5m 12s | Max: 8m 17s

🟩 python: Pass: 100%/1 | Total: 35m 43s | Avg: 35m 43s | Max: 35m 43s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 96)

Uh oh!

github-actions bot commented Jan 9, 2025

🟨 cub: Pass: 97%/45 | Total: 16h 40m | Avg: 22m 13s | Max: 1h 13m | Hits: 187%/2340

🟩 thrust: Pass: 100%/44 | Total: 10h 18m | Avg: 14m 03s | Max: 1h 16m | Hits: 151%/7408

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 45s | Avg: 6m 52s | Max: 11m 46s

🟩 python: Pass: 100%/1 | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 92)

Uh oh!

github-actions bot commented Jan 9, 2025

🟩 cub: Pass: 100%/45 | Total: 16h 54m | Avg: 22m 32s | Max: 1h 13m | Hits: 187%/2340

🟩 thrust: Pass: 100%/44 | Total: 10h 18m | Avg: 14m 03s | Max: 1h 16m | Hits: 151%/7408

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 45s | Avg: 6m 52s | Max: 11m 46s

🟩 python: Pass: 100%/1 | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 92)

Uh oh!

github-actions bot commented Jan 9, 2025

🟩 cub: Pass: 100%/47 | Total: 1d 15h | Avg: 50m 27s | Max: 1h 06m | Hits: 410%/3900

🟩 thrust: Pass: 100%/46 | Total: 1d 00h | Avg: 31m 37s | Max: 56m 10s | Hits: 266%/11112

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 13s | Avg: 5m 06s | Max: 8m 04s

🟩 python: Pass: 100%/1 | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 96)

Uh oh!

fbusato commented Jan 10, 2025

Uh oh!

elstehle commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2025

🟩 cub: Pass: 100%/38 | Total: 11h 52m | Avg: 18m 44s | Max: 38m 50s | Hits: 568%/3120

🟩 thrust: Pass: 100%/37 | Total: 6h 13m | Avg: 10m 06s | Max: 37m 15s | Hits: 365%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 27s | Avg: 5m 13s | Max: 8m 24s

🟩 python: Pass: 100%/1 | Total: 27m 12s | Avg: 27m 12s | Max: 27m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2025

🟩 cub: Pass: 100%/38 | Total: 12h 56m | Avg: 20m 26s | Max: 37m 37s | Hits: 568%/3120

🟩 thrust: Pass: 100%/37 | Total: 6h 05m | Avg: 9m 52s | Max: 36m 03s | Hits: 365%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 04s | Avg: 5m 02s | Max: 7m 06s

Improves `DeviceSegmentedSort` test run time for large number of items and segments #3246

Improves `DeviceSegmentedSort` test run time for large number of items and segments #3246

elstehle commented Jan 6, 2025 •

edited

Loading

elstehle commented Jan 10, 2025 •

edited

Loading