Skip to content

Conversation

@charan-003
Copy link
Contributor

@charan-003 charan-003 commented Aug 27, 2025

  • Remove legacy include/host_device.h headers
  • Replace manual element assignment with std::initializer_list
  • Use range-based for loops where appropriate
  • Apply Thrust algorithms (Thrust::generate) with lambdas
  • Use .size() instead of hardcoded array sizes
  • Improve semantic naming and inline usage
  • Maintain compatibility with current CUDA/Thrust version
  • Modernized to use thrust::generate, cuda::std::distance, thrust::sequence

Following: NVIDIA/thrust#753

- Remove legacy include/host_device.h headers from 40 example files
- Replace manual element assignment with std::initializer_list
- Use range-based for loops where appropriate
- Apply STL algorithms (std::generate) with lambdas
- Use .size() instead of hardcoded array sizes
- Improve semantic naming and inline usage
- Maintain compatibility with current CUDA/Thrust version
- Avoid thrust::enumerate (not available in current version)

Files modernized:
arbitrary_transformation.cu, basic_vector.cu, bounding_box.cu,
bucket_sort2d.cu, constant_iterator.cu, counting_iterator.cu,
device_ptr.cu, discrete_voronoi.cu, dot_products_with_zip.cu,
expand.cu, histogram.cu, lambda.cu, lexicographical_sort.cu,
max_abs_diff.cu, minmax.cu, mode.cu, monte_carlo.cu,
monte_carlo_disjoint_sequences.cu, norm.cu, padded_grid_reduction.cu,
permutation_iterator.cu, raw_reference_cast.cu, remove_points2d.cu,
repeated_range.cu, saxpy.cu, scan_matrix_by_rows.cu,
simple_moving_average.cu, sort.cu, sorting_aos_vs_soa.cu,
stream_compaction.cu, sum_rows.cu, summary_statistics.cu,
summed_area_table.cu, tiled_range.cu, transform_input_output_iterator.cu,
transform_iterator.cu, transform_output_iterator.cu,
uninitialized_vector.cu, weld_vertices.cu, word_count.cu
@charan-003 charan-003 requested a review from a team as a code owner August 27, 2025 03:00
@charan-003 charan-003 requested a review from wmaxey August 27, 2025 03:00
@github-project-automation github-project-automation bot moved this to Todo in CCCL Aug 27, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Aug 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Aug 27, 2025
@charan-003 charan-003 marked this pull request as draft August 27, 2025 03:05
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Aug 27, 2025
@charan-003 charan-003 marked this pull request as ready for review August 27, 2025 03:32
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Aug 27, 2025
@miscco
Copy link
Contributor

miscco commented Aug 27, 2025

/ok to test 3b62cc1

Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for improving the example. This is already much better, I have some concerns about using std:: algorithms, because they only run on host and will segfault with device memory

We need to use the equivalent thrust algorithms

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Aug 27, 2025
@github-actions
Copy link
Contributor

🟨 CI finished in 2h 13m: Pass: 64%/140 | Total: 1d 01h | Avg: 10m 58s | Max: 2h 10m | Hits: 99%/69320
  • 🟥 thrust: Pass: 0%/50 | Total: 7h 53m | Avg: 9m 27s | Max: 33m 16s

    🟥 cmake_options
      🟥 -DTHRUST_DISPATCH_TYPE=Force32bit Pass:   0%/2   | Total:  7m 32s | Avg:  3m 46s | Max:  7m 32s
    🟥 cpu
      🟥 amd64              Pass:   0%/48  | Total:  7h 41m | Avg:  9m 36s | Max: 33m 16s
      🟥 arm64              Pass:   0%/2   | Total: 12m 01s | Avg:  6m 00s | Max:  7m 00s
    🟥 ctk
      🟥 12.0               Pass:   0%/5   | Total: 53m 45s | Avg: 10m 45s | Max: 29m 32s
      🟥 12.9               Pass:   0%/45  | Total:  6h 59m | Avg:  9m 19s | Max: 33m 16s
    🟥 cudacxx
      🟥 ClangCUDA19        Pass:   0%/2   | Total: 10m 58s | Avg:  5m 29s | Max:  5m 46s
      🟥 nvcc12.0           Pass:   0%/5   | Total: 53m 45s | Avg: 10m 45s | Max: 29m 32s
      🟥 nvcc12.9           Pass:   0%/43  | Total:  6h 48m | Avg:  9m 29s | Max: 33m 16s
    🟥 cudacxx_family
      🟥 ClangCUDA          Pass:   0%/2   | Total: 10m 58s | Avg:  5m 29s | Max:  5m 46s
      🟥 nvcc               Pass:   0%/48  | Total:  7h 42m | Avg:  9m 37s | Max: 33m 16s
    🟥 cxx
      🟥 Clang14            Pass:   0%/4   | Total: 23m 08s | Avg:  5m 47s | Max:  6m 20s
      🟥 Clang15            Pass:   0%/2   | Total: 11m 57s | Avg:  5m 58s | Max:  6m 17s
      🟥 Clang16            Pass:   0%/2   | Total: 11m 23s | Avg:  5m 41s | Max:  5m 45s
      🟥 Clang17            Pass:   0%/2   | Total: 11m 31s | Avg:  5m 45s | Max:  5m 48s
      🟥 Clang18            Pass:   0%/2   | Total: 11m 49s | Avg:  5m 54s | Max:  6m 09s
      🟥 Clang19            Pass:   0%/7   | Total: 27m 14s | Avg:  3m 53s | Max:  5m 46s
      🟥 GCC7               Pass:   0%/2   | Total: 13m 28s | Avg:  6m 44s | Max:  6m 46s
      🟥 GCC8               Pass:   0%/1   | Total:  7m 20s | Avg:  7m 20s | Max:  7m 20s
      🟥 GCC9               Pass:   0%/2   | Total: 14m 24s | Avg:  7m 12s | Max:  7m 39s
      🟥 GCC10              Pass:   0%/2   | Total: 14m 21s | Avg:  7m 10s | Max:  7m 17s
      🟥 GCC11              Pass:   0%/2   | Total: 15m 11s | Avg:  7m 35s | Max:  7m 36s
      🟥 GCC12              Pass:   0%/2   | Total: 16m 08s | Avg:  8m 04s | Max:  8m 13s
      🟥 GCC13              Pass:   0%/11  | Total: 50m 40s | Avg:  4m 36s | Max:  8m 21s
      🟥 MSVC14.29          Pass:   0%/2   | Total:  1h 00m | Avg: 30m 07s | Max: 30m 42s
      🟥 MSVC14.43          Pass:   0%/5   | Total:  2h 01m | Avg: 24m 23s | Max: 32m 17s
      🟥 NVHPC25.5          Pass:   0%/2   | Total:  1h 02m | Avg: 31m 12s | Max: 33m 16s
    🟥 cxx_family
      🟥 Clang              Pass:   0%/19  | Total:  1h 37m | Avg:  5m 06s | Max:  6m 20s
      🟥 GCC                Pass:   0%/22  | Total:  2h 11m | Avg:  5m 58s | Max:  8m 21s
      🟥 MSVC               Pass:   0%/7   | Total:  3h 02m | Avg: 26m 01s | Max: 32m 17s
      🟥 NVHPC              Pass:   0%/2   | Total:  1h 02m | Avg: 31m 12s | Max: 33m 16s
    🟥 gpu
      🟥 h100               Pass:   0%/2   | Total:  6m 01s | Avg:  3m 00s | Max:  6m 01s
      🟥 rtx2080            Pass:   0%/38  | Total:  6h 53m | Avg: 10m 52s | Max: 33m 16s
      🟥 rtx4090            Pass:   0%/10  | Total: 53m 43s | Avg:  5m 22s | Max: 32m 17s
    🟥 jobs
      🟥 Build              Pass:   0%/43  | Total:  7h 53m | Avg: 11m 00s | Max: 33m 16s
      🟥 TestCPU            Pass:   0%/3  
      🟥 TestGPU            Pass:   0%/4  
    🟥 sm
      🟥 90                 Pass:   0%/2   | Total:  6m 01s | Avg:  3m 00s | Max:  6m 01s
      🟥 90;90a             Pass:   0%/2   | Total: 36m 24s | Avg: 18m 12s | Max: 29m 46s
      🟥 100;120            Pass:   0%/2   | Total: 34m 59s | Avg: 17m 29s | Max: 28m 00s
    🟥 std
      🟥 17                 Pass:   0%/21  | Total:  3h 56m | Avg: 11m 16s | Max: 33m 16s
      🟥 20                 Pass:   0%/27  | Total:  3h 48m | Avg:  8m 28s | Max: 32m 17s
    
  • 🟩 cub: Pass: 100%/50 | Total: 11h 10m | Avg: 13m 24s | Max: 38m 46s | Hits: 99%/53242

    🟩 cpu
      🟩 amd64              Pass: 100%/48  | Total: 10h 56m | Avg: 13m 40s | Max: 38m 46s | Hits:  99%/50648 
      🟩 arm64              Pass: 100%/2   | Total: 14m 41s | Avg:  7m 20s | Max:  8m 42s | Hits:  99%/2594  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 04m | Avg: 12m 48s | Max: 35m 08s | Hits:  99%/6378  
      🟩 12.9               Pass: 100%/45  | Total: 10h 06m | Avg: 13m 28s | Max: 38m 46s | Hits:  99%/46864 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 10m 12s | Avg:  5m 06s | Max:  5m 14s | Hits:  99%/2233  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 04m | Avg: 12m 48s | Max: 35m 08s | Hits:  99%/6378  
      🟩 nvcc12.9           Pass: 100%/43  | Total:  9h 56m | Avg: 13m 52s | Max: 38m 46s | Hits:  99%/44631 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 12s | Avg:  5m 06s | Max:  5m 14s | Hits:  99%/2233  
      🟩 nvcc               Pass: 100%/48  | Total: 11h 00m | Avg: 13m 45s | Max: 38m 46s | Hits:  99%/51009 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 26m 00s | Avg:  6m 30s | Max:  6m 49s | Hits:  99%/5190  
      🟩 Clang15            Pass: 100%/2   | Total: 13m 44s | Avg:  6m 52s | Max:  6m 53s | Hits:  99%/2591  
      🟩 Clang16            Pass: 100%/2   | Total: 13m 53s | Avg:  6m 56s | Max:  7m 12s | Hits:  99%/2591  
      🟩 Clang17            Pass: 100%/2   | Total: 14m 34s | Avg:  7m 17s | Max:  7m 23s | Hits:  99%/2591  
      🟩 Clang18            Pass: 100%/2   | Total: 14m 13s | Avg:  7m 06s | Max:  7m 07s | Hits:  99%/2591  
      🟩 Clang19            Pass: 100%/7   | Total:  1h 14m | Avg: 10m 38s | Max: 24m 08s | Hits:  99%/6120  
      🟩 GCC7               Pass: 100%/2   | Total: 16m 41s | Avg:  8m 20s | Max:  8m 35s | Hits:  99%/2594  
      🟩 GCC8               Pass: 100%/1   | Total:  8m 47s | Avg:  8m 47s | Max:  8m 47s | Hits:  99%/1297  
      🟩 GCC9               Pass: 100%/2   | Total: 17m 30s | Avg:  8m 45s | Max:  9m 04s | Hits:  99%/2594  
      🟩 GCC10              Pass: 100%/2   | Total: 18m 35s | Avg:  9m 17s | Max:  9m 27s | Hits:  99%/2595  
      🟩 GCC11              Pass: 100%/2   | Total: 18m 04s | Avg:  9m 02s | Max:  9m 05s | Hits:  99%/2591  
      🟩 GCC12              Pass: 100%/2   | Total: 19m 28s | Avg:  9m 44s | Max:  9m 52s | Hits:  99%/2591  
      🟩 GCC13              Pass: 100%/12  | Total:  3h 03m | Avg: 15m 16s | Max: 25m 20s | Hits:  99%/7785  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 09m | Avg: 34m 40s | Max: 35m 08s | Hits:  99%/2378  
      🟩 MSVC14.43          Pass: 100%/4   | Total:  2h 15m | Avg: 33m 46s | Max: 38m 46s | Hits:  99%/4756  
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 27m 00s | Avg: 13m 30s | Max: 13m 55s | Hits:  98%/2387  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  2h 36m | Avg:  8m 15s | Max: 24m 08s | Hits:  99%/21674 
      🟩 GCC                Pass: 100%/23  | Total:  4h 42m | Avg: 12m 16s | Max: 25m 20s | Hits:  99%/22047 
      🟩 MSVC               Pass: 100%/6   | Total:  3h 24m | Avg: 34m 04s | Max: 38m 46s | Hits:  99%/7134  
      🟩 NVHPC              Pass: 100%/2   | Total: 27m 00s | Avg: 13m 30s | Max: 13m 55s | Hits:  98%/2387  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total: 56m 50s | Avg: 18m 56s | Max: 25m 20s | Hits:  99%/1298  
      🟩 rtx2080            Pass: 100%/39  | Total:  7h 51m | Avg: 12m 04s | Max: 38m 46s | Hits:  99%/49350 
      🟩 rtxa6000           Pass: 100%/8   | Total:  2h 22m | Avg: 17m 50s | Max: 24m 08s | Hits:  99%/2594  
    🟩 jobs
      🟩 Build              Pass: 100%/42  | Total:  8h 15m | Avg: 11m 47s | Max: 38m 46s | Hits:  99%/53242 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 07s | Avg: 22m 07s | Max: 22m 07s
      🟩 GraphCapture       Pass: 100%/1   | Total: 14m 49s | Avg: 14m 49s | Max: 14m 49s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 11m | Avg: 23m 58s | Max: 24m 10s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 06m | Avg: 22m 16s | Max: 25m 20s
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total: 56m 50s | Avg: 18m 56s | Max: 25m 20s | Hits:  99%/1298  
      🟩 90;90a             Pass: 100%/2   | Total: 38m 01s | Avg: 19m 00s | Max: 29m 58s | Hits:  99%/2487  
      🟩 100;120            Pass: 100%/2   | Total: 39m 22s | Avg: 19m 41s | Max: 31m 00s | Hits:  99%/2487  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total:  4h 11m | Avg: 11m 59s | Max: 35m 22s | Hits:  99%/26612 
      🟩 20                 Pass: 100%/29  | Total:  6h 58m | Avg: 14m 26s | Max: 38m 46s | Hits:  99%/26630 
    
  • 🟩 cudax: Pass: 100%/28 | Total: 2h 57m | Avg: 6m 19s | Max: 28m 11s | Hits: 99%/15398

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  2h 45m | Avg:  6m 52s | Max: 28m 11s | Hits:  99%/13026 
      🟩 arm64              Pass: 100%/4   | Total: 11m 56s | Avg:  2m 59s | Max:  3m 20s | Hits:  99%/2372  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 19m 09s | Avg:  6m 23s | Max: 12m 57s | Hits:  98%/1476  
      🟩 12.9               Pass: 100%/25  | Total:  2h 37m | Avg:  6m 18s | Max: 28m 11s | Hits:  99%/13922 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 19m 09s | Avg:  6m 23s | Max: 12m 57s | Hits:  98%/1476  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  2h 37m | Avg:  6m 18s | Max: 28m 11s | Hits:  99%/13922 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  2h 57m | Avg:  6m 19s | Max: 28m 11s | Hits:  99%/15398 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total:  6m 14s | Avg:  3m 07s | Max:  3m 23s | Hits: 100%/1188  
      🟩 Clang15            Pass: 100%/1   | Total:  3m 29s | Avg:  3m 29s | Max:  3m 29s | Hits: 100%/593   
      🟩 Clang16            Pass: 100%/1   | Total:  3m 12s | Avg:  3m 12s | Max:  3m 12s | Hits: 100%/593   
      🟩 Clang17            Pass: 100%/1   | Total:  3m 13s | Avg:  3m 13s | Max:  3m 13s | Hits: 100%/593   
      🟩 Clang18            Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s | Hits: 100%/593   
      🟩 Clang19            Pass: 100%/4   | Total: 36m 47s | Avg:  9m 11s | Max: 28m 11s | Hits: 100%/2372  
      🟩 GCC10              Pass: 100%/2   | Total:  7m 02s | Avg:  3m 31s | Max:  3m 41s | Hits:  99%/1188  
      🟩 GCC11              Pass: 100%/1   | Total:  3m 41s | Avg:  3m 41s | Max:  3m 41s | Hits:  99%/593   
      🟩 GCC12              Pass: 100%/1   | Total:  3m 56s | Avg:  3m 56s | Max:  3m 56s | Hits:  99%/593   
      🟩 GCC13              Pass: 100%/8   | Total: 38m 54s | Avg:  4m 51s | Max:  9m 49s | Hits:  99%/4744  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 57s | Avg: 12m 57s | Max: 12m 57s | Hits:  95%/290   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 39m 22s | Avg: 13m 07s | Max: 13m 39s | Hits:  95%/876   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 14m 51s | Avg:  7m 25s | Max:  7m 45s | Hits:  97%/1182  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total: 56m 18s | Avg:  5m 37s | Max: 28m 11s | Hits: 100%/5932  
      🟩 GCC                Pass: 100%/12  | Total: 53m 33s | Avg:  4m 27s | Max:  9m 49s | Hits:  99%/7118  
      🟩 MSVC               Pass: 100%/4   | Total: 52m 19s | Avg: 13m 04s | Max: 13m 39s | Hits:  95%/1166  
      🟩 NVHPC              Pass: 100%/2   | Total: 14m 51s | Avg:  7m 25s | Max:  7m 45s | Hits:  97%/1182  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 11m 25s | Avg:  5m 42s | Max:  8m 00s | Hits:  99%/1186  
      🟩 rtx2080            Pass: 100%/26  | Total:  2h 45m | Avg:  6m 22s | Max: 28m 11s | Hits:  99%/14212 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  2h 11m | Avg:  5m 14s | Max: 13m 39s | Hits:  99%/13619 
      🟩 Test               Pass: 100%/3   | Total: 46m 00s | Avg: 15m 20s | Max: 28m 11s | Hits:  99%/1779  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 11m 25s | Avg:  5m 42s | Max:  8m 00s | Hits:  99%/1186  
      🟩 90;90a             Pass: 100%/2   | Total: 16m 25s | Avg:  8m 12s | Max: 12m 49s | Hits:  98%/885   
      🟩 100;120            Pass: 100%/2   | Total: 16m 32s | Avg:  8m 16s | Max: 12m 54s | Hits:  98%/885   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 13m 42s | Avg:  4m 34s | Max:  7m 45s | Hits:  99%/1777  
      🟩 20                 Pass: 100%/25  | Total:  2h 43m | Avg:  6m 31s | Max: 28m 11s | Hits:  99%/13621 
    
  • 🟩 cccl_c_parallel: Pass: 100%/4 | Total: 2h 54m | Avg: 43m 33s | Max: 2h 10m | Hits: 98%/680

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total:  2h 54m | Avg: 43m 33s | Max:  2h 10m | Hits:  98%/680   
    🟩 ctk
      🟩 12.9               Pass: 100%/4   | Total:  2h 54m | Avg: 43m 33s | Max:  2h 10m | Hits:  98%/680   
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/4   | Total:  2h 54m | Avg: 43m 33s | Max:  2h 10m | Hits:  98%/680   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total:  2h 54m | Avg: 43m 33s | Max:  2h 10m | Hits:  98%/680   
    🟩 cxx
      🟩 GCC13              Pass: 100%/4   | Total:  2h 54m | Avg: 43m 33s | Max:  2h 10m | Hits:  98%/680   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/4   | Total:  2h 54m | Avg: 43m 33s | Max:  2h 10m | Hits:  98%/680   
    🟩 gpu
      🟩 h100               Pass: 100%/1   | Total: 22m 38s | Avg: 22m 38s | Max: 22m 38s | Hits:  98%/170   
      🟩 l4                 Pass: 100%/1   | Total: 19m 00s | Avg: 19m 00s | Max: 19m 00s | Hits:  98%/170   
      🟩 rtx2080            Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  2h 10m | Hits:  98%/340   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 01s | Avg:  2m 01s | Max:  2m 01s | Hits:  98%/170   
      🟩 Test               Pass: 100%/3   | Total:  2h 52m | Avg: 57m 23s | Max:  2h 10m | Hits:  98%/510   
    
  • 🟩 packaging: Pass: 100%/4 | Total: 24m 54s | Avg: 6m 13s | Max: 8m 29s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 24m 54s | Avg:  6m 13s | Max:  8m 29s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total: 13m 59s | Avg:  6m 59s | Max:  8m 17s
      🟩 12.9               Pass: 100%/2   | Total: 10m 55s | Avg:  5m 27s | Max:  8m 29s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total: 13m 59s | Avg:  6m 59s | Max:  8m 17s
      🟩 nvcc12.9           Pass: 100%/2   | Total: 10m 55s | Avg:  5m 27s | Max:  8m 29s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 24m 54s | Avg:  6m 13s | Max:  8m 29s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  8m 17s | Avg:  8m 17s | Max:  8m 17s
      🟩 Clang19            Pass: 100%/1   | Total:  8m 29s | Avg:  8m 29s | Max:  8m 29s
      🟩 GCC12              Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
      🟩 GCC13              Pass: 100%/1   | Total:  2m 26s | Avg:  2m 26s | Max:  2m 26s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total: 16m 46s | Avg:  8m 23s | Max:  8m 29s
      🟩 GCC                Pass: 100%/2   | Total:  8m 08s | Avg:  4m 04s | Max:  5m 42s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 24m 54s | Avg:  6m 13s | Max:  8m 29s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 24m 54s | Avg:  6m 13s | Max:  8m 29s
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 16m 00s | Avg: 4m 00s | Max: 4m 20s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 25s | Avg:  4m 12s | Max:  4m 20s
      🟩 arm64              Pass: 100%/2   | Total:  7m 35s | Avg:  3m 47s | Max:  3m 48s
    🟩 ctk
      🟩 12.9               Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 cxx
      🟩 NVHPC25.5          Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 16m 00s | Avg:  4m 00s | Max:  4m 20s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total:  7m 53s | Avg:  3m 56s | Max:  4m 05s
      🟩 20                 Pass: 100%/2   | Total:  8m 07s | Avg:  4m 03s | Max:  4m 20s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
+/- Thrust
CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- stdpar
python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 140)

# Runner
91 linux-amd64-cpu16
17 windows-amd64-cpu16
10 linux-arm64-cpu16
7 linux-amd64-gpu-rtx2080-latest-1
6 linux-amd64-gpu-rtxa6000-latest-1
5 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
1 linux-amd64-gpu-l4-latest-1

@charan-003
Copy link
Contributor Author

Thanks a lot for improving the example. This is already much better, I have some concerns about using std:: algorithms, because they only run on host and will segfault with device memory

We need to use the equivalent thrust algorithms

thank you so much for the review.

sure, let me work on that

@charan-003 charan-003 requested a review from miscco August 29, 2025 03:46
Comment on lines 73 to 84
mutable unsigned int seed;

random_point_generator()
: seed(0)
{}

__host__ __device__ point2d operator()() const
{
thrust::default_random_engine rng(seed++);
thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
return point2d(u01(rng), u01(rng));
}
Copy link
Contributor

@miscco miscco Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really do not want to do this.

Initializing a random number generator is expensive!

You should hold the RNG as the member and initialize it on construction then in the call operator only call it

Copy link
Contributor Author

@charan-003 charan-003 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for that.
i get it now, let me change that

@charan-003 charan-003 requested a review from miscco August 29, 2025 07:27
Comment on lines 97 to 100
bbox init = bbox(points[0], points[0]);

// compute the bounding box for the point set
bbox result = thrust::reduce(points.begin(), points.end(), first_point, bbox_union{});
bbox result = thrust::reduce(points.begin(), points.end(), init, bbox_union{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are those changes necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable rename isn't functionally necessary, just following naming conventions from the modernization patterns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to keep first_point if preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant more why its not bbox init{points[0], points[0]};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The direct constructor syntax bbox init(points[0], points[0]); is cleaner
Let me quickly update that

@charan-003 charan-003 requested a review from miscco August 29, 2025 12:29
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this is looking so much better that the original examples 🎉

One more final nitpick

#include <thrust/random.h>
#include <thrust/reduce.h>

#include <algorithm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those are not longer needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, let me remove them

@charan-003
Copy link
Contributor Author

Thanks for updating those files! I was looking into it but got busy with university, and when I tried to recreate some of those changes locally I wasn't able to get them working properly. I noticed you handled some of the backend abstraction issues and compilation context stuff that I hadn't considered in my modernization work. The discussion about when certain patterns are appropriate was really insightful. Is there a good way for me to learn more about these CUDA-specific design decisions so I can tackle similar issues better next time

@miscco
Copy link
Contributor

miscco commented Sep 5, 2025

The discussion about when certain patterns are appropriate was really insightful. Is there a good way for me to learn more about these CUDA-specific design decisions so I can tackle similar issues better next time

I dont think there is a definitive place to look.

Most of it comes down to which backend thrust uses. If it uses the CUDA backend, then we need to consider device memory and also annotate the respective functions appropriately for the CUDA compiler to generate device code.

@charan-003
Copy link
Contributor Author

Custom maximum functor for compatibility between different compilation environments:

  • nvcc command line compilation doesn't have cuda::maximum available
  • CMake build treats thrust::maximum as deprecated and requires cuda::maximum

This custom functor works with both compilation methods

@srinivasyadav18
Copy link
Contributor

/ok to test 6e58d56

@github-actions

This comment has been minimized.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to merge. A few more comments:

Comment on lines 60 to 64
thrust::host_vector<int> host_data(N);
for (size_t i = 0; i < host_data.size(); i++)
{
data[i] = dist(rng);
host_data[i] = dist(rng);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use a range for:

for (auto& e : host_data)
    e = dist(rng);

Comment on lines 79 to 81
for (size_t i = 0; i < data.size(); i++)
{
std::cout << data[i] << " ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Could also use a range for. We could do this as a follow up PR though. Probably applies to a lot more places.

thrust::default_random_engine rng(123456);
thrust::uniform_int_distribution<int> dist(0, 9);
for (size_t i = 0; i < v.size(); i++)
thrust::host_vector<thrust::pair<int, int>> host_data(v.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: as a follow-up PR, we should replace thrust::pair by cuda::std::pair.


// print data
for (int i = 0; i < R; i++)
for (size_t i = 0; i < static_cast<size_t>(R); i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why is the size_t beneficial here? Using int i would simplify this. Also on the next loop below.

Comment on lines 54 to 56
MyStruct s;
s.key = dist(rng);
h_structures[i] = s;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: The code before did the same. Why is the new version an improvement?

@charan-003
Copy link
Contributor Author

Almost good to merge. A few more comments:

Sure , let me work on them.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments, then we are good I think.

endif()

# We do not want to explicitly include `host_device.h` if not needed, so force include the file for non CUDA targets
target_compile_options(${example_target} PRIVATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miscco device_vector allocates memory on the current Thrust device system. If that is CUDA, it's CUDA device memory. If the device system is TBB, OMP or CPP, then a device_vector just behaves like a host vector. This is so Thrust can switch backends with the preprocessor.

@bernhardmgruber
Copy link
Contributor

/ok to test d171ad8

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 28m: Pass: 100%/159 | Total: 1d 19h | Max: 2h 19m | Hits: 97%/186166

See results here.

@bernhardmgruber bernhardmgruber merged commit da89394 into NVIDIA:main Sep 16, 2025
170 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Sep 16, 2025
@bernhardmgruber
Copy link
Contributor

Thanks a lot for the contribution. Great work!

@charan-003
Copy link
Contributor Author

Thanks a lot for the contribution. Great work!

Thanks a lot for support and guidance:))

@charan-003 charan-003 deleted the modernize-thrust-examples branch September 16, 2025 13:03
Copy link

@xtofl xtofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this up! It's great to have exemplary code available in examples.

The term 'modern' evolves, of course, we have lots of C++ goodies we didn't have at the time of creating this issue. These goodies can make the code more succinct and telling: generic lambdas, std::format, ...

I barely started looking at this PR. Though I won't have time to go into deep detail, I want to add some suggestions showing what my take on 'modern' is - cf. further on.

// print the output
std::cout << "Tuple functor" << std::endl;
for (int i = 0; i < 5; i++)
for (size_t i = 0; i < A.size(); i++)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the preferred form for (size_t i = 0; i != size(A); ++i) ?

Also,

  • is it possible to use iterators? (My C++ has been rusting for 5 years now)
  • let's not use std::endl unless needed (cf here)
  • can we use std::format to our advantage?
  • free functions improve encapsulation (cf. here)
for (
  auto it = make_zip_iterator(make_tuple(begin(A), begin(B), begin(C), begin(D))));
  it != make_zip_iterator(make_tuple(end(A), end(B), end(C), end(D))));
 ++it)
{
  std::cout << std::format("{} + {} * {} = {}\n", *it);
}

Maybe the make_zip_iterator(make_tuple(begin(A), ...))) can be extracted into a generic somehow, along the lines of

auto zip_begin(auto containers..) {
  return make_zip_iterator(make_tuple(begin(containers)...));
}
auto zip_end(auto containers..) {
  return make_zip_iterator(make_tuple(end(containers)...));
}

In which case the above simplifies further to

for (
  auto it = zip_begin(A, B, C, D);  it != zip_end(A, B, C, D); ++it)
{
  std::cout << std::format("{} + {} * {} = {}\n", *it);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! You are always free to create a PR yourself or start a discussion.

Isn't the preferred form for (size_t i = 0; i != size(A); ++i) ?

I have no preference here. The PR improved the situation by not using a magic number, which is good.

  • is it possible to use iterators? (My C++ has been rusting for 5 years now)

Yes, but iterating 4 ranges at the same time using a zip may also be a bit over-engineered. Using an index if fine here IMO. Examples should be easy.

  • let's not use std::endl unless needed (cf here)

Correct. Feel free to propose a PR to replace them by '\n'.

  • can we use std::format to our advantage?

CCCL still supports C++17, but I don't see a blocker with using C++20 in examples only. I will start a discussion internally.

  • free functions improve encapsulation (cf. here)

Again, for example code I have no preference here. I agree with this when writing library code.

for (
  auto it = make_zip_iterator(make_tuple(begin(A), begin(B), begin(C), begin(D))));
  it != make_zip_iterator(make_tuple(end(A), end(B), end(C), end(D))));
 ++it)
{
  std::cout << std::format("{} + {} * {} = {}\n", *it);
}

I think this does not increase readability or clarity of the example.

Maybe the make_zip_iterator(make_tuple(begin(A), ...))) can be extracted into a generic somehow, along the lines of

We have that today, just construct the zip_iterator and led CTAD deduce the arguments:

zip_iterator(begin(A), begin(B), begin(C);

Should deduce zip_iterator<decltype(begin(A)), ...>. That only works with cuda::zip_iterator. For thrust, you can at least skip the make_tuple, we fixed that some time ago.

pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Sep 24, 2025
* Modernize Thrust examples following PR NVIDIA#753 patterns

- Remove legacy include/host_device.h headers from 40 example files
- Replace manual element assignment with std::initializer_list
- Use range-based for loops where appropriate
- Apply STL algorithms (std::generate) with lambdas
- Use .size() instead of hardcoded array sizes
- Improve semantic naming and inline usage
- Maintain compatibility with current CUDA/Thrust version
- Avoid thrust::enumerate (not available in current version)
- Modernized to thrust::generate, cuda::std::distance, thrust::sequence where necessary

Files modernized:
arbitrary_transformation.cu, basic_vector.cu, bounding_box.cu,
bucket_sort2d.cu, constant_iterator.cu, counting_iterator.cu,
device_ptr.cu, discrete_voronoi.cu, dot_products_with_zip.cu,
expand.cu, histogram.cu, lambda.cu, lexicographical_sort.cu,
max_abs_diff.cu, minmax.cu, mode.cu, monte_carlo.cu,
monte_carlo_disjoint_sequences.cu, norm.cu, padded_grid_reduction.cu,
permutation_iterator.cu, raw_reference_cast.cu, remove_points2d.cu,
repeated_range.cu, saxpy.cu, scan_matrix_by_rows.cu,
simple_moving_average.cu, sort.cu, sorting_aos_vs_soa.cu,
stream_compaction.cu, sum_rows.cu, summary_statistics.cu,
summed_area_table.cu, tiled_range.cu, transform_input_output_iterator.cu,
transform_iterator.cu, transform_output_iterator.cu,
uninitialized_vector.cu, weld_vertices.cu, word_count.cu

Co-authored-by: Sai Charan <scharan@rostam1.rostam.cct.lsu.edu>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants