-
Notifications
You must be signed in to change notification settings - Fork 297
Replace CUB macros in more places #3930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace CUB macros in more places #3930
Conversation
| int num_bits = CUB_MIN(RADIX_BITS, end_bit - current_bit); | ||
| // FIXME(bgruber): the following replacement changes SASS for cub.test.device_radix_sort_pairs.lid_0 | ||
| // int num_bits = _CUDA_VSTD::min(+RADIX_BITS, end_bit - current_bit); | ||
| int num_bits = CUB_MIN(+RADIX_BITS, end_bit - current_bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really curious about why we need +RADIX_BITS, it should already be an int and so are end_bit and current_bit
What does this do?
| int num_bits = CUB_MIN(+RADIX_BITS, end_bit - current_bit); | |
| const int num_bits = _CUDA_VSTD::min<int>(RADIX_BITS, end_bit - current_bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unary plus is needed to produce a pr-value. Otherwise, RADIX_BITS is ODR-used. With your code I get:
error: identifier "cub::CUB_300000_SM_860::detail::radix_sort::AgentRadixSortHistogram< ::cub::CUB_300000_SM_860::AgentRadixSortHistogramPolicy<(int)128, (int)16, (int)1, unsigned int, (int)8> , (bool)1, unsigned int, unsigned int, ::cub::CUB_300000_SM_860::detail::identity_decomposer_t> ::RADIX_BITS" is undefined in device code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is genius
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is genius
It's just fighting C++ with C++.
🟨 CI finished in 1h 41m: Pass: 98%/93 | Total: 2d 14h | Avg: 40m 32s | Max: 1h 23m | Hits: 69%/133775
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
cub/cub/block/block_radix_rank.cuh
Outdated
|
|
||
| // Always at least one lane | ||
| LOG_COUNTER_LANES = CUB_MAX((int(RADIX_BITS) - int(LOG_PACKING_RATIO)), 0), | ||
| LOG_COUNTER_LANES = _CUDA_VSTD::max((int(RADIX_BITS) - int(LOG_PACKING_RATIO)), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: max<int>()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been told by my mentor @MFHava a long time ago that I should not specialize function templates, but rely on overloading and template argument deduction. I follow this whenever easily possible. I don't know the rational though. But I don't see a benefit of providing the template argument explicitly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also siding on using implicit template arguments.
| bool is_num_passes_odd = num_passes & 1; | ||
| int max_alt_passes = (num_passes * radix_bits) - num_bits; | ||
| int alt_end_bit = CUB_MIN(end_bit, begin_bit + (max_alt_passes * alt_radix_bits)); | ||
| int alt_end_bit = _CUDA_VSTD::min(end_bit, begin_bit + (max_alt_passes * alt_radix_bits)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to have more uniform code, should we use _CUDA_VSTD everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a question for @miscco. I personally have an easier time remembering ::cuda::std.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the macro is technically a bit safer because that also includes our inline ABI namespace which could in theory avoid ambiguities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I can try and use it more often. However, we should probably not use it in any interface, since we don't it to pop up in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to agree on one way to handle this and follow that throughout. Given @miscco suggestion, I think we want to use _CUDA_VSTD within our implementation and then use ::cuda::std within interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever we come up with, let's document the decision here: #2635
fa26ab3 to
1c15db7
Compare
🟨 CI finished in 51m 35s: Pass: 48%/93 | Total: 14h 50m | Avg: 9m 34s | Max: 50m 17s | Hits: 92%/51280
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
8657f2e to
97a3d2f
Compare
97a3d2f to
a9e94cc
Compare
🟨 CI finished in 1h 03m: Pass: 94%/93 | Total: 16h 43m | Avg: 10m 47s | Max: 49m 06s | Hits: 94%/127854
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 1h 13m: Pass: 100%/93 | Total: 1d 20h | Avg: 28m 45s | Max: 1h 08m | Hits: 90%/133929
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 1h 13m: Pass: 100%/93 | Total: 21h 03m | Avg: 13m 35s | Max: 1h 12m | Hits: 94%/133929
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
No SASS change on cub.test.device_radix_sort_pairs.lid_0 for SM86
Split out of #3821. Contains changes related to radix sort and the computation of compile-time constants. Also some changes in the tests are added, since they cannot change CUB's performance.