[SYCL][COMPAT] Launch kernels using the enqueue functions extensions#13642
[SYCL][COMPAT] Launch kernels using the enqueue functions extensions#13642AD2605 wants to merge 24 commits intointel:syclfrom
Conversation
…at_launch_w_properties
…_launch_w_properties
…_launch_w_properties
|
@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating |
joeatodd
left a comment
There was a problem hiding this comment.
Hey @AD2605, thanks for this contribution. Unfortunately there's a lot of untested functionality in here. I would suggest for the sake of speed that you might want to make a new PR with only the subset dealing with KernelPropertiesStruct with some tests. We can park this PR for now and re-open it once we've looked at how best to introduce both kernel and launch properties together.
| * work groups per compute unit and maximum cluster size. | ||
| * Also provides quick utility structs using subgorup size 16 and 8 | ||
| * Utilizes the following extension - | ||
| * sycl_ext_oneapi_kernel_properties |
There was a problem hiding this comment.
In our README.md, there's a list of required SYCL extensions for SYCLcompat. Please could you add the 3 exts that this functionality depends on there?
Alternatively, if you are keen to introduce all this functionality, we can do so, so long as it's tested, and on the understanding that the API might change once we've reviewed the |
| template <auto KernelFunc, typename tuple, std::size_t... I> | ||
| __attribute__((always_inline)) inline void | ||
| run_kernel(tuple args, std::index_sequence<I...>) { | ||
| KernelFunc(std::get<I>(args)...); | ||
| } | ||
|
|
||
| template <auto KernelFunc, typename tuple> | ||
| __attribute__((always_inline)) inline void run_kernel(tuple args) { | ||
| auto indices = std::make_index_sequence<std::tuple_size_v<tuple>>{}; | ||
| run_kernel<KernelFunc>(args, indices); | ||
| } | ||
|
|
||
| template <auto KernelFunc, class KernelPropertiesStruct, bool UsesLocalMemory, | ||
| typename... Args> | ||
| struct KernelFunctor { | ||
| KernelFunctor(Args... args, char *local_mem_ptr = nullptr) | ||
| : argument_tuple(std::make_tuple(args...)), local_mem_ptr(local_mem_ptr) { | ||
| } | ||
|
|
||
| auto get(sycl_exp::properties_tag) { return kernel_properties; } | ||
|
|
||
| __attribute__((always_inline)) inline void | ||
| operator()(sycl::nd_item<3> it) const { | ||
| if constexpr (UsesLocalMemory) { | ||
| run_kernel<KernelFunc>( | ||
| std::tuple_cat(argument_tuple, std::make_tuple(local_mem_ptr))); | ||
| } else { | ||
| run_kernel<KernelFunc>(argument_tuple); | ||
| } | ||
| } | ||
|
|
||
| std::tuple<Args...> argument_tuple; | ||
| char *local_mem_ptr; | ||
| static constexpr auto kernel_properties = | ||
| KernelPropertiesStruct::kernel_properties; | ||
| }; | ||
| } // namespace detail |
There was a problem hiding this comment.
Instead of trying to wrap the kernel function and all the kernel attributes in this internal KernelFunctor struct, wouldn't it be simpler (and more flexible) to allow the caller to pass a KernelFunctor "like" struct as a parameter for the launch function? Something in the line of:
template <auto KernelFunctor, typename... Args>
launch(const sycl::nd_range<3> &launch_params, KernelFunctor kernelFunctor,
const sycl::queue &queue, Args... args) {
...
}
This should allow you to simplify this code a lot
There was a problem hiding this comment.
The launch which are not in the detail namespace, are the user facing launch's, which will be called by the user. Since the KernelFunctor struct is a requirement of the extension, I do not suppose it should be passed on to the user. Also I wanted to keep it similar to the current launch APIs,
Also, how would it make this more flexible, I did not get that part, so if you could please elaborate
There was a problem hiding this comment.
Users can implement the struct in whatever way they want and provide whatever list of kernel properties they need. They just have to maintain the signature of KernelFunctor so the launcher knows which methods to call. That's super flexible from the user's point of view.
In this PR, you're essentially asking the user for each individual piece of information in the KernelFunctor struct so you can build your own internal KernelFunctor. That's forcing you to define over 20 new launch functions to cover all the combinations.
If the user provides the KernelFunctor, you can mostly reuse the current launch API. Just add a new parameter (the kernelFunctor) and replace F (the function kernel) template parameter with KernelFunctor. The rest of the API remains the same.
There was a problem hiding this comment.
That's forcing you to define over 20 new launch functions to cover all the combinations.
Yeah that's true. I was just approaching from an ease of user standpoint, such that they have the least amount of work. But yes, I can change the approach and offload the KernelFunctor onto the user.
There was a problem hiding this comment.
I think this is a fair compromise: if the user wants to do more complicated stuff, they can be responsible for creating the KernelFunctor. Can you do this in the syclcompat::experimental namespace still, till we figure out the best long term stable solution?
joeatodd
left a comment
There was a problem hiding this comment.
Hey @AD2605 thanks for paring this PR back a bit. I think this could still be simpler, and that has the significant advantage of requiring fewer tests. Specifically I don't think you need:
- launch overloads taking
sycl::range<Dim>, sycl::range<Dim>args - launch overloads which don't take a PropertyList (though I appreciate why you added these)
I would strongly recommend moving those because:
- you won't then be obliged to write a load more tests
- we're likely to remove these when we move this out of
experimental.
Aside from that, I think this is coming together pretty well. You still need to ensure all your overloads are tested and documented.
Thanks for the contribution 👍
| LaunchTestWithArgs<T> ltt; | ||
| if (ltt.skip_) // Unsupported aspect | ||
| return; | ||
|
|
| T *h_a = (T *)syclcompat::malloc_host(ltt.memsize_ * sizeof(T)); | ||
| T *d_a = (T *)syclcompat::malloc(ltt.memsize_ * sizeof(T)); |
There was a problem hiding this comment.
ltt.memsize_ defines the size (in bytes) of local memory used by these tests. Here (and below) you are using it as number of elements.
| template <int Dim, auto KernelFunctor, typename... Args> | ||
| inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>, | ||
| sycl::event> | ||
| launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, , |
There was a problem hiding this comment.
What's going on here? Missing argument?
There was a problem hiding this comment.
I think this also implies you need to look again at the coverage your tests are providing.
|
|
||
| #if defined(SYCL_EXT_ONEAPI_KERNEL_PROPERTIES) && \ | ||
| defined(SYCL_EXT_ONEAPI_PROPERTIES) | ||
| // defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once |
There was a problem hiding this comment.
| // defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once | |
| // defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) // FIXME(@intel/syclcompat-lib-reviewers): uncomment once |
| launch(const sycl::range<Dim> &global_range, | ||
| const sycl::range<Dim> &local_range, std::size_t local_memory_size, | ||
| const PropertyList &launch_properties, const Args &...args) { | ||
| return launch<KernelFunctor>( | ||
| ::syclcompat::detail::transform_nd_range( | ||
| sycl::nd_range<Dim>(global_range, local_range)), | ||
| local_memory_size, launch_properties, ::syclcompat::get_default_queue(), | ||
| args...); | ||
| } | ||
|
|
||
| template <int Dim, auto KernelFunctor, typename... Args> | ||
| inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args..., char *>, | ||
| sycl::event> | ||
| launch(const sycl::range<Dim> &global_range, | ||
| const sycl::range<Dim> &local_range, std::size_t local_memory_size, | ||
| const Args &...args) { | ||
| using PropertyList = decltype(detail::empty_property_list); | ||
| return launch<KernelFunctor, PropertyList>( | ||
| ::syclcompat::detail::transform_nd_range( | ||
| sycl::nd_range<Dim>(global_range, local_range)), | ||
| local_memory_size, detail::empty_property_list, args...); | ||
| } |
There was a problem hiding this comment.
Do you need these overloads which take 2 sycl::range<Dim>, sycl::range<Dim> args? Isn't it sufficient to have sycl::nd_range overload and dim3, dim3 overload?
| template <int Dim, auto KernelFunctor, typename PropertyList, typename... Args> | ||
| inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>, | ||
| sycl::event> | ||
| launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, | ||
| const PropertyList &launch_properties, const Args &...args) { | ||
| return launch<KernelFunctor>( | ||
| ::syclcompat::detail::transform_nd_range( | ||
| sycl::nd_range<3>(global_range, local_range)), | ||
| launch_properties, ::syclcompat::get_default_queue(), args...); | ||
| } | ||
|
|
||
| template <int Dim, auto KernelFunctor, typename... Args> | ||
| inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>, | ||
| sycl::event> | ||
| launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, , | ||
| const Args &...args) { | ||
| using PropertyList = decltype(detail::empty_property_list); | ||
| return launch<KernelFunctor, PropertyList>( | ||
| ::syclcompat::detail::transform_nd_range( | ||
| sycl::nd_range<Dim>(global_range, local_range)), | ||
| empty_properties_t, args...); | ||
| } | ||
|
|
There was a problem hiding this comment.
As above, do you need these overloads?
| using PropertyList = decltype(detail::empty_property_list); | ||
| return launch<KernelFunctor, PropertyList>( |
There was a problem hiding this comment.
The overloads you have provided which don't take a PropertyList and instead pass an empty one: these are nice because they reflect how the syclcompat::launch functions will work once we integrate this properly. However, for now they are just duplicating the equivalent syclcompat::launch API but without tests. If you don't want to bother adding more tests for equivalent APIs, I would suggest pulling out these overloads.
|
I think that there needs to be some kind of specialization that will call a new unified runtime function from the new UR cuda plugin extension I'm adding that calls cudaLaunchKernelExC with the cluster dimensions. I'm not sure if you've added this already somewhere in this code? e·g. is there/ do you plan to add an specialization of launch/parallel_for that can specialize for the argument that you have here: https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140 Thanks |
|
Hi @JackAKirk , thanks for having a look at this PR. I did a little digging after your comment, (I have not looked into implementing the UR Side) This would also mean one can launch a kernel with cluster as I did not know this parallel for overload existed. What I do not see however, is the overloads introduced in But yeah to answer your question, a new overload will not be required, but just a specialization of the parallel_for_impl which accepts the properties, and possibly a bug fix in the |
|
@JackAKirk, we're planning to overhaul the |
|
It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. Line 311 in af65855 llvm/sycl/source/detail/scheduler/commands.cpp Line 2369 in af65855 piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.
|
One question I had was whether you can have cooperative kernels and set launch time cluster size at the same time. It turns out that you can. Whilst their interfaces are quite different, functionally
This would then resolve the issues raised, because all backends could switch to using the new "launch-time-kernel" UR interface that I will add, and the logic of dpc++ can generalize the |
|
Closing this for now as we went another way. |
To support launching kernels with compile time known kernel properties and runtime / compile time known launch properties, this PR adds new
launchoverloads in a newsyclcompat::experimentalnamespace, making use of the following 3 extensions -SYCL_EXT_ONEAPI_KERNEL_PROPERTIESSYCL_EXT_ONEAPI_PROPERTIESSYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS