-
Notifications
You must be signed in to change notification settings - Fork 300
Adds support for large number of segments to DeviceSegmentedReduce
#3764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for large number of segments to DeviceSegmentedReduce
#3764
Conversation
🟨 CI finished in 1h 28m: Pass: 95%/90 | Total: 2d 13h | Avg: 40m 50s | Max: 1h 20m | Hits: 75%/125290
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
🟨 CI finished in 1h 38m: Pass: 96%/90 | Total: 2d 12h | Avg: 40m 16s | Max: 1h 16m | Hits: 75%/125436
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
🟨 CI finished in 1h 05m: Pass: 97%/90 | Total: 14h 28m | Avg: 9m 38s | Max: 34m 34s | Hits: 94%/127036
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
🟩 CI finished in 1h 30m: Pass: 100%/90 | Total: 15h 03m | Avg: 10m 02s | Max: 34m 34s | Hits: 95%/129482
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
🟨 CI finished in 1h 30m: Pass: 96%/93 | Total: 2d 13h | Avg: 39m 28s | Max: 1h 15m | Hits: 74%/130146
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 23m: Pass: 96%/93 | Total: 23h 11m | Avg: 14m 57s | Max: 1h 20m | Hits: 93%/130146
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 28m: Pass: 48%/93 | Total: 1d 19h | Avg: 28m 21s | Max: 1h 16m | Hits: 78%/80136
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
6ad4e02 to
b1ad690
Compare
🟨 CI finished in 1h 25m: Pass: 96%/93 | Total: 2d 12h | Avg: 39m 13s | Max: 1h 19m | Hits: 73%/133750
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 37m: Pass: 96%/93 | Total: 2d 15h | Avg: 41m 15s | Max: 1h 20m | Hits: 75%/133699
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
|
A few notes on breaking c.parallel in this PR. Current State@elstehle added a streaming approach to support large number of segments. The approach consists in processing up to Problemc.parallel type-erases all iterators and passes them as struct indirect_arg_t {
void* ptr;
void* operator&() const {
return ptr;
}
};Later, when we use driver API to launch a kernel, we take address of indirect argument SolutionTo unblock this PR, I'd suggest to remove support of large problem sizes from c.parallel. We can SFINAE on presense of Supporting large number of segments in segmented reduction on c.parallel end is not trivial and should be addressed separately. Given that we'll reuse a solution to this problem in other algorithms, I think it's worth investing time into small research on c.parallel end. We'll likely need an struct offset_iterator_t {
user_provided_iterator_state_t state;
int64_t offset;
operator+=(difference_type diff) {
ADVANCE(state, diff);
}
operator*() {
offset_iterator_t it = this + offset;
return DEREF(it.state);
}
};Then, indirect iterator type would increment offset on the host and we won't have to invoke device code for advance compiled by numba.cuda. This requires extending user-provided state, which is a bit tricky, but we have a solution in |
🟨 CI finished in 1h 35m: Pass: 58%/93 | Total: 2d 00h | Avg: 31m 17s | Max: 1h 20m | Hits: 76%/86560
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 1h 27m: Pass: 100%/93 | Total: 2d 16h | Avg: 41m 31s | Max: 1h 24m | Hits: 74%/134019
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
|
@gevtushenko FYI, we have such an |
I'm afraid it won't work, since |
…VIDIA#3764) * add support for large num segments on device level * adds support for large number of segments on dispatch * refactors offset iterator * add tests for large number of segments * fixes style * renames offset iterator to snake case * rely on ctad instead of factory function * adds tests for more device interfaces * use offset_input_iterator where applicable * [skip-ci] addresses review comments * fixes msvc implicit conversion warning * drops debug print utilities * removes argmin/max wrappers * fixes style * fixes include order * fixes nvrtc * expects user iterators to be advancable on the host * drops redundant include * adds workaround for c.parallel indirect_arg_t * adds todo * uses cuda::std traits * adds missing exec space specifiers
…VIDIA#3764) * add support for large num segments on device level * adds support for large number of segments on dispatch * refactors offset iterator * add tests for large number of segments * fixes style * renames offset iterator to snake case * rely on ctad instead of factory function * adds tests for more device interfaces * use offset_input_iterator where applicable * [skip-ci] addresses review comments * fixes msvc implicit conversion warning * drops debug print utilities * removes argmin/max wrappers * fixes style * fixes include order * fixes nvrtc * expects user iterators to be advancable on the host * drops redundant include * adds workaround for c.parallel indirect_arg_t * adds todo * uses cuda::std traits * adds missing exec space specifiers
Description
Closes #3242