-
Notifications
You must be signed in to change notification settings - Fork 341
Modernize Thrust examples #5670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modernize Thrust examples #5670
Conversation
- Remove legacy include/host_device.h headers from 40 example files - Replace manual element assignment with std::initializer_list - Use range-based for loops where appropriate - Apply STL algorithms (std::generate) with lambdas - Use .size() instead of hardcoded array sizes - Improve semantic naming and inline usage - Maintain compatibility with current CUDA/Thrust version - Avoid thrust::enumerate (not available in current version) Files modernized: arbitrary_transformation.cu, basic_vector.cu, bounding_box.cu, bucket_sort2d.cu, constant_iterator.cu, counting_iterator.cu, device_ptr.cu, discrete_voronoi.cu, dot_products_with_zip.cu, expand.cu, histogram.cu, lambda.cu, lexicographical_sort.cu, max_abs_diff.cu, minmax.cu, mode.cu, monte_carlo.cu, monte_carlo_disjoint_sequences.cu, norm.cu, padded_grid_reduction.cu, permutation_iterator.cu, raw_reference_cast.cu, remove_points2d.cu, repeated_range.cu, saxpy.cu, scan_matrix_by_rows.cu, simple_moving_average.cu, sort.cu, sorting_aos_vs_soa.cu, stream_compaction.cu, sum_rows.cu, summary_statistics.cu, summed_area_table.cu, tiled_range.cu, transform_input_output_iterator.cu, transform_iterator.cu, transform_output_iterator.cu, uninitialized_vector.cu, weld_vertices.cu, word_count.cu
|
/ok to test 3b62cc1 |
miscco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for improving the example. This is already much better, I have some concerns about using std:: algorithms, because they only run on host and will segfault with device memory
We need to use the equivalent thrust algorithms
🟨 CI finished in 2h 13m: Pass: 64%/140 | Total: 1d 01h | Avg: 10m 58s | Max: 2h 10m | Hits: 99%/69320
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| CUB | |
| +/- | Thrust |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| python | |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 140)
| # | Runner |
|---|---|
| 91 | linux-amd64-cpu16 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 1 | linux-amd64-gpu-l4-latest-1 |
thank you so much for the review. sure, let me work on that |
thrust/examples/bounding_box.cu
Outdated
| mutable unsigned int seed; | ||
|
|
||
| random_point_generator() | ||
| : seed(0) | ||
| {} | ||
|
|
||
| __host__ __device__ point2d operator()() const | ||
| { | ||
| thrust::default_random_engine rng(seed++); | ||
| thrust::uniform_real_distribution<float> u01(0.0f, 1.0f); | ||
| return point2d(u01(rng), u01(rng)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You really do not want to do this.
Initializing a random number generator is expensive!
You should hold the RNG as the member and initialize it on construction then in the call operator only call it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for that.
i get it now, let me change that
thrust/examples/bounding_box.cu
Outdated
| bbox init = bbox(points[0], points[0]); | ||
|
|
||
| // compute the bounding box for the point set | ||
| bbox result = thrust::reduce(points.begin(), points.end(), first_point, bbox_union{}); | ||
| bbox result = thrust::reduce(points.begin(), points.end(), init, bbox_union{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are those changes necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable rename isn't functionally necessary, just following naming conventions from the modernization patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to keep first_point if preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I meant more why its not bbox init{points[0], points[0]};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The direct constructor syntax bbox init(points[0], points[0]); is cleaner
Let me quickly update that
miscco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, this is looking so much better that the original examples 🎉
One more final nitpick
thrust/examples/sum_rows.cu
Outdated
| #include <thrust/random.h> | ||
| #include <thrust/reduce.h> | ||
|
|
||
| #include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe those are not longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, let me remove them
|
Thanks for updating those files! I was looking into it but got busy with university, and when I tried to recreate some of those changes locally I wasn't able to get them working properly. I noticed you handled some of the backend abstraction issues and compilation context stuff that I hadn't considered in my modernization work. The discussion about when certain patterns are appropriate was really insightful. Is there a good way for me to learn more about these CUDA-specific design decisions so I can tackle similar issues better next time |
I dont think there is a definitive place to look. Most of it comes down to which backend thrust uses. If it uses the CUDA backend, then we need to consider device memory and also annotate the respective functions appropriately for the CUDA compiler to generate device code. |
|
Custom maximum functor for compatibility between different compilation environments:
This custom functor works with both compilation methods |
|
/ok to test 6e58d56 |
This comment has been minimized.
This comment has been minimized.
bernhardmgruber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost good to merge. A few more comments:
| thrust::host_vector<int> host_data(N); | ||
| for (size_t i = 0; i < host_data.size(); i++) | ||
| { | ||
| data[i] = dist(rng); | ||
| host_data[i] = dist(rng); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use a range for:
for (auto& e : host_data)
e = dist(rng);
thrust/examples/minmax.cu
Outdated
| for (size_t i = 0; i < data.size(); i++) | ||
| { | ||
| std::cout << data[i] << " "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Could also use a range for. We could do this as a follow up PR though. Probably applies to a lot more places.
thrust/examples/sort.cu
Outdated
| thrust::default_random_engine rng(123456); | ||
| thrust::uniform_int_distribution<int> dist(0, 9); | ||
| for (size_t i = 0; i < v.size(); i++) | ||
| thrust::host_vector<thrust::pair<int, int>> host_data(v.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: as a follow-up PR, we should replace thrust::pair by cuda::std::pair.
thrust/examples/sum_rows.cu
Outdated
|
|
||
| // print data | ||
| for (int i = 0; i < R; i++) | ||
| for (size_t i = 0; i < static_cast<size_t>(R); i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, why is the size_t beneficial here? Using int i would simplify this. Also on the next loop below.
| MyStruct s; | ||
| s.key = dist(rng); | ||
| h_structures[i] = s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: The code before did the same. Why is the new version an improvement?
Sure , let me work on them. |
bernhardmgruber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments, then we are good I think.
| endif() | ||
|
|
||
| # We do not want to explicitly include `host_device.h` if not needed, so force include the file for non CUDA targets | ||
| target_compile_options(${example_target} PRIVATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miscco device_vector allocates memory on the current Thrust device system. If that is CUDA, it's CUDA device memory. If the device system is TBB, OMP or CPP, then a device_vector just behaves like a host vector. This is so Thrust can switch backends with the preprocessor.
|
/ok to test d171ad8 |
🥳 CI Workflow Results🟩 Finished in 2h 28m: Pass: 100%/159 | Total: 1d 19h | Max: 2h 19m | Hits: 97%/186166See results here. |
|
Thanks a lot for the contribution. Great work! |
Thanks a lot for support and guidance:)) |
xtofl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this up! It's great to have exemplary code available in examples.
The term 'modern' evolves, of course, we have lots of C++ goodies we didn't have at the time of creating this issue. These goodies can make the code more succinct and telling: generic lambdas, std::format, ...
I barely started looking at this PR. Though I won't have time to go into deep detail, I want to add some suggestions showing what my take on 'modern' is - cf. further on.
| // print the output | ||
| std::cout << "Tuple functor" << std::endl; | ||
| for (int i = 0; i < 5; i++) | ||
| for (size_t i = 0; i < A.size(); i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the preferred form for (size_t i = 0; i != size(A); ++i) ?
Also,
- is it possible to use iterators? (My C++ has been rusting for 5 years now)
- let's not use
std::endlunless needed (cf here) - can we use
std::formatto our advantage? - free functions improve encapsulation (cf. here)
for (
auto it = make_zip_iterator(make_tuple(begin(A), begin(B), begin(C), begin(D))));
it != make_zip_iterator(make_tuple(end(A), end(B), end(C), end(D))));
++it)
{
std::cout << std::format("{} + {} * {} = {}\n", *it);
}Maybe the make_zip_iterator(make_tuple(begin(A), ...))) can be extracted into a generic somehow, along the lines of
auto zip_begin(auto containers..) {
return make_zip_iterator(make_tuple(begin(containers)...));
}
auto zip_end(auto containers..) {
return make_zip_iterator(make_tuple(end(containers)...));
}In which case the above simplifies further to
for (
auto it = zip_begin(A, B, C, D); it != zip_end(A, B, C, D); ++it)
{
std::cout << std::format("{} + {} * {} = {}\n", *it);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! You are always free to create a PR yourself or start a discussion.
Isn't the preferred form
for (size_t i = 0; i != size(A); ++i)?
I have no preference here. The PR improved the situation by not using a magic number, which is good.
- is it possible to use iterators? (My C++ has been rusting for 5 years now)
Yes, but iterating 4 ranges at the same time using a zip may also be a bit over-engineered. Using an index if fine here IMO. Examples should be easy.
- let's not use
std::endlunless needed (cf here)
Correct. Feel free to propose a PR to replace them by '\n'.
- can we use
std::formatto our advantage?
CCCL still supports C++17, but I don't see a blocker with using C++20 in examples only. I will start a discussion internally.
- free functions improve encapsulation (cf. here)
Again, for example code I have no preference here. I agree with this when writing library code.
for ( auto it = make_zip_iterator(make_tuple(begin(A), begin(B), begin(C), begin(D)))); it != make_zip_iterator(make_tuple(end(A), end(B), end(C), end(D)))); ++it) { std::cout << std::format("{} + {} * {} = {}\n", *it); }
I think this does not increase readability or clarity of the example.
Maybe the
make_zip_iterator(make_tuple(begin(A), ...)))can be extracted into a generic somehow, along the lines of
We have that today, just construct the zip_iterator and led CTAD deduce the arguments:
zip_iterator(begin(A), begin(B), begin(C);Should deduce zip_iterator<decltype(begin(A)), ...>. That only works with cuda::zip_iterator. For thrust, you can at least skip the make_tuple, we fixed that some time ago.
* Modernize Thrust examples following PR NVIDIA#753 patterns - Remove legacy include/host_device.h headers from 40 example files - Replace manual element assignment with std::initializer_list - Use range-based for loops where appropriate - Apply STL algorithms (std::generate) with lambdas - Use .size() instead of hardcoded array sizes - Improve semantic naming and inline usage - Maintain compatibility with current CUDA/Thrust version - Avoid thrust::enumerate (not available in current version) - Modernized to thrust::generate, cuda::std::distance, thrust::sequence where necessary Files modernized: arbitrary_transformation.cu, basic_vector.cu, bounding_box.cu, bucket_sort2d.cu, constant_iterator.cu, counting_iterator.cu, device_ptr.cu, discrete_voronoi.cu, dot_products_with_zip.cu, expand.cu, histogram.cu, lambda.cu, lexicographical_sort.cu, max_abs_diff.cu, minmax.cu, mode.cu, monte_carlo.cu, monte_carlo_disjoint_sequences.cu, norm.cu, padded_grid_reduction.cu, permutation_iterator.cu, raw_reference_cast.cu, remove_points2d.cu, repeated_range.cu, saxpy.cu, scan_matrix_by_rows.cu, simple_moving_average.cu, sort.cu, sorting_aos_vs_soa.cu, stream_compaction.cu, sum_rows.cu, summary_statistics.cu, summed_area_table.cu, tiled_range.cu, transform_input_output_iterator.cu, transform_iterator.cu, transform_output_iterator.cu, uninitialized_vector.cu, weld_vertices.cu, word_count.cu Co-authored-by: Sai Charan <scharan@rostam1.rostam.cct.lsu.edu> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Following: NVIDIA/thrust#753