-
Notifications
You must be signed in to change notification settings - Fork 317
Implement the new tuning API for DeviceTransform #6914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
4244463 to
43feb21
Compare
c/parallel/src/transform.cu
Outdated
| auto make_iterator_info(cccl_iterator_t input_it) -> cdt::iterator_info | ||
| { | ||
| return {static_cast<int>(input_it.value_type.size), | ||
| static_cast<int>(input_it.value_type.alignment), | ||
| /* trivially_relocatable */ true, // TODO(bgruber): how to check this properly? | ||
| input_it.type == CCCL_POINTER}; // TODO(bgruber): how to check this properly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate some cccl.c maintainer input here. How I do know whether the iterator's value type is trivially relocatable and the iterator is contiguous?
| std::unique_ptr<arch_policies<1>> rtp(static_cast<arch_policies<1>*>(build_ptr->runtime_policy)); // FIXME(bgruber): | ||
| // handle <2> as | ||
| // well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way in this function to distinguish whether we build the unary or binary transform?
fca1221 to
2aade5f
Compare
This comment has been minimized.
This comment has been minimized.
2aade5f to
57cc332
Compare
| _CCCL_API constexpr int get_block_threads_helper() | ||
| { | ||
| if constexpr (ActivePolicy::algorithm == Algorithm::prefetch) | ||
| constexpr transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate the arcane / 10 here with a passion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to call ::cuda::current_arch_id() but it's not constexpr on NVHPC by design.
| #if _CCCL_HAS_CONCEPTS() | ||
| requires transform_policy_hub<ArchPolicies> | ||
| #endif // _CCCL_HAS_CONCEPTS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I believe we should either use the concept emulation or plain SFINAE in C++17 too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. We could also static_assert, but ArchPolicies is already used in the kernel attributes before we reach the body. And using a static_assert would only be evaluated in the device path.
How would I write that using concept emulation and have the concept check before the __launch_bounds__?
| bool all_inputs_contiguous = true; | ||
| bool all_input_values_trivially_reloc = true; | ||
| bool can_memcpy_contiguous_inputs = true; | ||
| bool all_value_types_have_power_of_two_size = ::cuda::is_power_of_two(output.value_type_size); | ||
| for (const auto& input : inputs) | ||
| { | ||
| all_inputs_contiguous &= input.is_contiguous; | ||
| all_input_values_trivially_reloc &= input.value_type_is_trivially_relocatable; | ||
| // the vectorized kernel supports mixing contiguous and non-contiguous iterators | ||
| can_memcpy_contiguous_inputs &= !input.is_contiguous || input.value_type_is_trivially_relocatable; | ||
| all_value_types_have_power_of_two_size &= ::cuda::is_power_of_two(input.value_type_size); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: While it is technically more efficient, I believe it would improve readability if we did
const bool all_inputs_contiguous = ::cuda::std::all_of(input.begin(), input.end(), [](const auto& input) { return input.is_contiguous; })
This comment has been minimized.
This comment has been minimized.
cb0fac5 to
1d14a3e
Compare
This comment has been minimized.
This comment has been minimized.
1d14a3e to
a661d8f
Compare
|
I see tiny changes in the generated SASS for The fill lernel for All kernels with a functor marked as It feels a bit like the items per thread changed for the fill kernels. |
They did. Before we had a tuning policy for sm_120, that was not taken into account :D This PR now uses it. |
|
I disabled the sm120 fill policy and now the only SASS diff for filling is on: which is a |
|
Found the final issue with the fill kernels. Disabled the vectorized tunings when we have input streams (they were tuned for output only use cases). SASS of |
1139c44 to
c8b2ef6
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b1d9947 to
1eb0a20
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| // for function: found previous definition of same function!"` when we pass a const& as template parameter (and the | ||
| // function template body contains a lambda). As a workaround, we pass the parts of the policy by value. | ||
| // TODO(bgruber): In C++20, we should just pass transform_arch_policy by value. | ||
| template < // const transform_arch_policy& Policy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to pass transform_arch_policy by const& in the template arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform_arch_policy is not a literal type in C++17, so we can only pass a pointer or reference to a static constexpr instance of it. However, nvcc dies if I do this, so I had to pass the relevant members of transform_arch_policy instead.
| // if we have to fall back to prefetching, use these values: | ||
| int prefetch_items_per_thread_no_input = 2; | ||
| int prefetch_min_items_per_thread = 1; | ||
| int prefetch_max_items_per_thread = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this rather hold a prefetch_policy instead of the individual members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I initially inherited from a prefetch_policy, but then designated initializers don't work anymore. Then I did a prefetch_policy member, which was awkward again, because now you had to write policy.prefetch_policy.block_threads to get the block threads for the vectorized policy in case the fall back is not needed. The current state avoids both issues, but I agree, it's not nice.
| store_size > 4 ? 128 : 256, 16, ::cuda::std::max(8 / store_size, 1) /* 64-bit instructions */}; | ||
| } | ||
| // manually tuned fill on A100 | ||
| if (arch >= ::cuda::arch_id::sm_90) // TODO(bgruber): this should be sm_80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this then not changed? This code path cannot be taken currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because enabling it caused sudden compilation errors that I didn't understand and I had to make progress elsewhere. But this needs to be fixed before merging.
53d0629 to
766a3ec
Compare
|
I pulled out the arch dispatching logic into: #7093 |
reference to local variable 'arch_seq' declared in enclosing function 'cub::detail::dispatch_arch'
This comment has been minimized.
This comment has been minimized.
766a3ec to
3afcacf
Compare
3afcacf to
7040c72
Compare
😬 CI Workflow Results🟥 Finished in 6h 00m: Pass: 94%/143 | Total: 8d 07h | Max: 6h 00m | Hits: 72%/178221See results here. |
Fixes: #6919
cuda::__all_arch_idsandcuda::__is_specific_arch#6916cub.bench.transform.babelstream.basecub.test.device.transform.lid_0__CUDA_ARCH_LIST__.)Compile time of
cub.test.device.transform.lid_0using nvcc 13.1 and clang 20 for sm86, sm120TODO: outdated
branch:
2m8.741s
2m7.726s
2m7.949s
main:
2m7.661s
2m6.072s
2m9.804s
Using clang 20 in CUDA mode:
branch:
real 1m40.627s
real 1m40.675s
real 1m40.912s
main:
real 1m39.273s
real 1m39.669s
real 1m39.835s