Skip to content

Comments

[SYCL][COMPAT] Launch kernels using the enqueue functions extensions#13642

Closed
AD2605 wants to merge 24 commits intointel:syclfrom
AD2605:atharva/sycl_compat_launch_w_properties
Closed

[SYCL][COMPAT] Launch kernels using the enqueue functions extensions#13642
AD2605 wants to merge 24 commits intointel:syclfrom
AD2605:atharva/sycl_compat_launch_w_properties

Conversation

@AD2605
Copy link
Contributor

@AD2605 AD2605 commented May 2, 2024

To support launching kernels with compile time known kernel properties and runtime / compile time known launch properties, this PR adds new launch overloads in a new syclcompat::experimental namespace, making use of the following 3 extensions -

  • SYCL_EXT_ONEAPI_KERNEL_PROPERTIES
  • SYCL_EXT_ONEAPI_PROPERTIES
  • SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS

@joeatodd
Copy link
Contributor

joeatodd commented May 6, 2024

@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating kernel_properties into the main launch APIs. Detailed review to follow, but for now can I suggest that the code in launch_experimental.hpp could just be incorporated into launch.hpp directly, within the syclcompat::experimental namespace?

Copy link
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AD2605, thanks for this contribution. Unfortunately there's a lot of untested functionality in here. I would suggest for the sake of speed that you might want to make a new PR with only the subset dealing with KernelPropertiesStruct with some tests. We can park this PR for now and re-open it once we've looked at how best to introduce both kernel and launch properties together.

* work groups per compute unit and maximum cluster size.
* Also provides quick utility structs using subgorup size 16 and 8
* Utilizes the following extension -
* sycl_ext_oneapi_kernel_properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our README.md, there's a list of required SYCL extensions for SYCLcompat. Please could you add the 3 exts that this functionality depends on there?

@joeatodd
Copy link
Contributor

joeatodd commented May 6, 2024

@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating kernel_properties into the main launch APIs. Detailed review to follow, but for now can I suggest that the code in launch_experimental.hpp could just be incorporated into launch.hpp directly, within the syclcompat::experimental namespace?

Alternatively, if you are keen to introduce all this functionality, we can do so, so long as it's tested, and on the understanding that the API might change once we've reviewed the launch API in general.

Comment on lines 54 to 90
template <auto KernelFunc, typename tuple, std::size_t... I>
__attribute__((always_inline)) inline void
run_kernel(tuple args, std::index_sequence<I...>) {
KernelFunc(std::get<I>(args)...);
}

template <auto KernelFunc, typename tuple>
__attribute__((always_inline)) inline void run_kernel(tuple args) {
auto indices = std::make_index_sequence<std::tuple_size_v<tuple>>{};
run_kernel<KernelFunc>(args, indices);
}

template <auto KernelFunc, class KernelPropertiesStruct, bool UsesLocalMemory,
typename... Args>
struct KernelFunctor {
KernelFunctor(Args... args, char *local_mem_ptr = nullptr)
: argument_tuple(std::make_tuple(args...)), local_mem_ptr(local_mem_ptr) {
}

auto get(sycl_exp::properties_tag) { return kernel_properties; }

__attribute__((always_inline)) inline void
operator()(sycl::nd_item<3> it) const {
if constexpr (UsesLocalMemory) {
run_kernel<KernelFunc>(
std::tuple_cat(argument_tuple, std::make_tuple(local_mem_ptr)));
} else {
run_kernel<KernelFunc>(argument_tuple);
}
}

std::tuple<Args...> argument_tuple;
char *local_mem_ptr;
static constexpr auto kernel_properties =
KernelPropertiesStruct::kernel_properties;
};
} // namespace detail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of trying to wrap the kernel function and all the kernel attributes in this internal KernelFunctor struct, wouldn't it be simpler (and more flexible) to allow the caller to pass a KernelFunctor "like" struct as a parameter for the launch function? Something in the line of:

template <auto KernelFunctor, typename... Args>
launch(const sycl::nd_range<3> &launch_params, KernelFunctor kernelFunctor,
       const sycl::queue &queue, Args... args) { 
       ...
}

This should allow you to simplify this code a lot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The launch which are not in the detail namespace, are the user facing launch's, which will be called by the user. Since the KernelFunctor struct is a requirement of the extension, I do not suppose it should be passed on to the user. Also I wanted to keep it similar to the current launch APIs,

Also, how would it make this more flexible, I did not get that part, so if you could please elaborate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can implement the struct in whatever way they want and provide whatever list of kernel properties they need. They just have to maintain the signature of KernelFunctor so the launcher knows which methods to call. That's super flexible from the user's point of view.

In this PR, you're essentially asking the user for each individual piece of information in the KernelFunctor struct so you can build your own internal KernelFunctor. That's forcing you to define over 20 new launch functions to cover all the combinations.

If the user provides the KernelFunctor, you can mostly reuse the current launch API. Just add a new parameter (the kernelFunctor) and replace F (the function kernel) template parameter with KernelFunctor. The rest of the API remains the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's forcing you to define over 20 new launch functions to cover all the combinations.

Yeah that's true. I was just approaching from an ease of user standpoint, such that they have the least amount of work. But yes, I can change the approach and offload the KernelFunctor onto the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a fair compromise: if the user wants to do more complicated stuff, they can be responsible for creating the KernelFunctor. Can you do this in the syclcompat::experimental namespace still, till we figure out the best long term stable solution?

@Alcpz Alcpz changed the title [SYCL][COMPAT] Launch kernels using the enqueue functions exntesions [SYCL][COMPAT] Launch kernels using the enqueue functions extensions May 8, 2024
@AD2605 AD2605 temporarily deployed to WindowsCILock May 13, 2024 10:40 — with GitHub Actions Inactive
Copy link
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AD2605 thanks for paring this PR back a bit. I think this could still be simpler, and that has the significant advantage of requiring fewer tests. Specifically I don't think you need:

  • launch overloads taking sycl::range<Dim>, sycl::range<Dim> args
  • launch overloads which don't take a PropertyList (though I appreciate why you added these)

I would strongly recommend moving those because:

  • you won't then be obliged to write a load more tests
  • we're likely to remove these when we move this out of experimental.

Aside from that, I think this is coming together pretty well. You still need to ensure all your overloads are tested and documented.

Thanks for the contribution 👍

LaunchTestWithArgs<T> ltt;
if (ltt.skip_) // Unsupported aspect
return;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +360 to +361
T *h_a = (T *)syclcompat::malloc_host(ltt.memsize_ * sizeof(T));
T *d_a = (T *)syclcompat::malloc(ltt.memsize_ * sizeof(T));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ltt.memsize_ defines the size (in bytes) of local memory used by these tests. Here (and below) you are using it as number of elements.

template <int Dim, auto KernelFunctor, typename... Args>
inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>,
sycl::event>
launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, ,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? Missing argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also implies you need to look again at the coverage your tests are providing.


#if defined(SYCL_EXT_ONEAPI_KERNEL_PROPERTIES) && \
defined(SYCL_EXT_ONEAPI_PROPERTIES)
// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once
// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) // FIXME(@intel/syclcompat-lib-reviewers): uncomment once

Comment on lines 101 to 122
launch(const sycl::range<Dim> &global_range,
const sycl::range<Dim> &local_range, std::size_t local_memory_size,
const PropertyList &launch_properties, const Args &...args) {
return launch<KernelFunctor>(
::syclcompat::detail::transform_nd_range(
sycl::nd_range<Dim>(global_range, local_range)),
local_memory_size, launch_properties, ::syclcompat::get_default_queue(),
args...);
}

template <int Dim, auto KernelFunctor, typename... Args>
inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args..., char *>,
sycl::event>
launch(const sycl::range<Dim> &global_range,
const sycl::range<Dim> &local_range, std::size_t local_memory_size,
const Args &...args) {
using PropertyList = decltype(detail::empty_property_list);
return launch<KernelFunctor, PropertyList>(
::syclcompat::detail::transform_nd_range(
sycl::nd_range<Dim>(global_range, local_range)),
local_memory_size, detail::empty_property_list, args...);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need these overloads which take 2 sycl::range<Dim>, sycl::range<Dim> args? Isn't it sufficient to have sycl::nd_range overload and dim3, dim3 overload?

Comment on lines 181 to 203
template <int Dim, auto KernelFunctor, typename PropertyList, typename... Args>
inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>,
sycl::event>
launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range,
const PropertyList &launch_properties, const Args &...args) {
return launch<KernelFunctor>(
::syclcompat::detail::transform_nd_range(
sycl::nd_range<3>(global_range, local_range)),
launch_properties, ::syclcompat::get_default_queue(), args...);
}

template <int Dim, auto KernelFunctor, typename... Args>
inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>,
sycl::event>
launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, ,
const Args &...args) {
using PropertyList = decltype(detail::empty_property_list);
return launch<KernelFunctor, PropertyList>(
::syclcompat::detail::transform_nd_range(
sycl::nd_range<Dim>(global_range, local_range)),
empty_properties_t, args...);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, do you need these overloads?

Comment on lines 92 to 93
using PropertyList = decltype(detail::empty_property_list);
return launch<KernelFunctor, PropertyList>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overloads you have provided which don't take a PropertyList and instead pass an empty one: these are nice because they reflect how the syclcompat::launch functions will work once we integrate this properly. However, for now they are just duplicating the equivalent syclcompat::launch API but without tests. If you don't want to bother adding more tests for equivalent APIs, I would suggest pulling out these overloads.

@JackAKirk
Copy link
Contributor

JackAKirk commented May 13, 2024

I think that there needs to be some kind of specialization that will call a new unified runtime function from the new UR cuda plugin extension I'm adding that calls cudaLaunchKernelExC with the cluster dimensions. I'm not sure if you've added this already somewhere in this code?

e·g. is there/ do you plan to add an specialization of launch/parallel_for that can specialize for the

properties  cluster_launch_property{ClusterRange(1, 2, 1)};

argument that you have here: https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140

Thanks

@AD2605
Copy link
Contributor Author

AD2605 commented May 14, 2024

Hi @JackAKirk , thanks for having a look at this PR.

I did a little digging after your comment, (I have not looked into implementing the UR Side)
So a parallel_for overload already exists here which accepts the property list , and the calls the overloaded parallel_for_impl. Over there, we can check if the property list contains the property ClusterRange and then call the UR function you are adding.

This would also mean one can launch a kernel with cluster as

cgh.parallel_for(nd_range(...), sycl::ext::oneapi::properties{ClusterRange(...)}, [=](nd_item<Dim>{}));

I did not know this parallel for overload existed. What I do not see however, is the overloads introduced in sycl_ext_oneapi_enqueue_functions calling this overload even when properties are mentioned, and even the tests added does not seem to test launch with properties ? (https://github.com/intel/llvm/pull/13512/files#diff-f6b7355d29c87088898f102554c5a82ed290c8261ab55c0c06adb3af7a9ac932)

But yeah to answer your question, a new overload will not be required, but just a specialization of the parallel_for_impl which accepts the properties, and possibly a bug fix in the sycl_ext_oneapi_enqueue_functions ?

@joeatodd
Copy link
Contributor

@JackAKirk, we're planning to overhaul the launch API prior to the 2025.0 release, largely in order to be able to accept whatever kernel and launch properties the user might specify in some kind of struct. So, assuming the cluster_launch_property can be used similarly to other launch properties, this shouldn't be a problem.

@JackAKirk
Copy link
Contributor

JackAKirk commented May 14, 2024

It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. MImpl->MKernelIsCustom similar to
MImpl->MKernelIsCooperative

Result = enqueueImpKernel(
,along with the additional kernel parameters, then this logic eventually makes its way to this function
static pi_result SetKernelParamsAndLaunch(
where it is used to call the appropriate the pi wrapper function e.g. piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.

@JackAKirk
Copy link
Contributor

JackAKirk commented May 14, 2024

It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. MImpl->MKernelIsCustom similar to MImpl->MKernelIsCooperative

Result = enqueueImpKernel(

,along with the additional kernel parameters, then this logic eventually makes its way to this function

static pi_result SetKernelParamsAndLaunch(

where it is used to call the appropriate the pi wrapper function e.g. piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.

One question I had was whether you can have cooperative kernels and set launch time cluster size at the same time. It turns out that you can. Whilst their interfaces are quite different, functionally cuLaunchCooperativeKernel is a subset of cuLaunchKernelEx. I think this is possibly going to provide a natural resolution of the issues I described above:

  • I imagine the intention of the CUDA api is that cuLaunchKernelEx will replace cuLaunchCooperativeKernel going forward, as a more general and extensible method of providing launch time configuration for cooperative kernels/ custom distributed shared memory "cluster group" config, or anything else.
  • We should sync with other backend stakeholders asap, but I think in turn it would be natural if the new UR api that maps to cuLaunchKernelEx could also eventually replace urEnqueueCooperativeKernelLaunch.

This would then resolve the issues raised, because all backends could switch to using the new "launch-time-kernel" UR interface that I will add, and the logic of dpc++ can generalize the MImpl->MKernelIsCooperative bool to something more general and appropriately named e.g. MImpl->MKernelIsLaunchTimeConfig

@joeatodd
Copy link
Contributor

joeatodd commented Jun 5, 2024

Closing this for now as we went another way.

@joeatodd joeatodd closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants