[SYCL][COMPAT] Launch kernels using the enqueue functions extensions by AD2605 · Pull Request #13642 · intel/llvm

AD2605 · 2024-05-02T15:41:47Z

To support launching kernels with compile time known kernel properties and runtime / compile time known launch properties, this PR adds new launch overloads in a new syclcompat::experimental namespace, making use of the following 3 extensions -

SYCL_EXT_ONEAPI_KERNEL_PROPERTIES
SYCL_EXT_ONEAPI_PROPERTIES
SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS

…rameter

…qd_sg_size

…reqd_sg_size

…at_launch_w_properties

…_launch_w_properties

joeatodd · 2024-05-06T08:28:24Z

@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating kernel_properties into the main launch APIs. Detailed review to follow, but for now can I suggest that the code in launch_experimental.hpp could just be incorporated into launch.hpp directly, within the syclcompat::experimental namespace?

joeatodd

Hey @AD2605, thanks for this contribution. Unfortunately there's a lot of untested functionality in here. I would suggest for the sake of speed that you might want to make a new PR with only the subset dealing with KernelPropertiesStruct with some tests. We can park this PR for now and re-open it once we've looked at how best to introduce both kernel and launch properties together.

joeatodd · 2024-05-06T08:30:57Z

sycl/include/syclcompat/kernel_properties.hpp

+ *    work groups per compute unit and maximum cluster size.
+ *    Also provides quick utility structs using subgorup size 16 and 8
+ *    Utilizes the following extension - 
+ *      sycl_ext_oneapi_kernel_properties


In our README.md, there's a list of required SYCL extensions for SYCLcompat. Please could you add the 3 exts that this functionality depends on there?

joeatodd · 2024-05-06T14:01:55Z

@AD2605 thanks a lot for this contribution. It's a useful addition, and it paves the way for eventually incorporating kernel_properties into the main launch APIs. Detailed review to follow, but for now can I suggest that the code in launch_experimental.hpp could just be incorporated into launch.hpp directly, within the syclcompat::experimental namespace?

Alternatively, if you are keen to introduce all this functionality, we can do so, so long as it's tested, and on the understanding that the API might change once we've reviewed the launch API in general.

aacostadiaz · 2024-05-06T16:38:06Z

sycl/include/syclcompat/launch_experimental.hpp

+template <auto KernelFunc, typename tuple, std::size_t... I>
+__attribute__((always_inline)) inline void
+run_kernel(tuple args, std::index_sequence<I...>) {
+  KernelFunc(std::get<I>(args)...);
+}
+
+template <auto KernelFunc, typename tuple>
+__attribute__((always_inline)) inline void run_kernel(tuple args) {
+  auto indices = std::make_index_sequence<std::tuple_size_v<tuple>>{};
+  run_kernel<KernelFunc>(args, indices);
+}
+
+template <auto KernelFunc, class KernelPropertiesStruct, bool UsesLocalMemory,
+          typename... Args>
+struct KernelFunctor {
+  KernelFunctor(Args... args, char *local_mem_ptr = nullptr)
+      : argument_tuple(std::make_tuple(args...)), local_mem_ptr(local_mem_ptr) {
+  }
+
+  auto get(sycl_exp::properties_tag) { return kernel_properties; }
+
+  __attribute__((always_inline)) inline void
+  operator()(sycl::nd_item<3> it) const {
+    if constexpr (UsesLocalMemory) {
+      run_kernel<KernelFunc>(
+          std::tuple_cat(argument_tuple, std::make_tuple(local_mem_ptr)));
+    } else {
+      run_kernel<KernelFunc>(argument_tuple);
+    }
+  }
+
+  std::tuple<Args...> argument_tuple;
+  char *local_mem_ptr;
+  static constexpr auto kernel_properties =
+      KernelPropertiesStruct::kernel_properties;
+};
+} // namespace detail


Instead of trying to wrap the kernel function and all the kernel attributes in this internal KernelFunctor struct, wouldn't it be simpler (and more flexible) to allow the caller to pass a KernelFunctor "like" struct as a parameter for the launch function? Something in the line of:

template <auto KernelFunctor, typename... Args> launch(const sycl::nd_range<3> &launch_params, KernelFunctor kernelFunctor, const sycl::queue &queue, Args... args) { ... }

This should allow you to simplify this code a lot

The launch which are not in the detail namespace, are the user facing launch's, which will be called by the user. Since the KernelFunctor struct is a requirement of the extension, I do not suppose it should be passed on to the user. Also I wanted to keep it similar to the current launch APIs,

Also, how would it make this more flexible, I did not get that part, so if you could please elaborate

Users can implement the struct in whatever way they want and provide whatever list of kernel properties they need. They just have to maintain the signature of KernelFunctor so the launcher knows which methods to call. That's super flexible from the user's point of view.

In this PR, you're essentially asking the user for each individual piece of information in the KernelFunctor struct so you can build your own internal KernelFunctor. That's forcing you to define over 20 new launch functions to cover all the combinations.

If the user provides the KernelFunctor, you can mostly reuse the current launch API. Just add a new parameter (the kernelFunctor) and replace F (the function kernel) template parameter with KernelFunctor. The rest of the API remains the same.

That's forcing you to define over 20 new launch functions to cover all the combinations.

Yeah that's true. I was just approaching from an ease of user standpoint, such that they have the least amount of work. But yes, I can change the approach and offload the KernelFunctor onto the user.

I think this is a fair compromise: if the user wants to do more complicated stuff, they can be responsible for creating the KernelFunctor. Can you do this in the syclcompat::experimental namespace still, till we figure out the best long term stable solution?

…_launch_w_properties

joeatodd

Hey @AD2605 thanks for paring this PR back a bit. I think this could still be simpler, and that has the significant advantage of requiring fewer tests. Specifically I don't think you need:

launch overloads taking sycl::range<Dim>, sycl::range<Dim> args
launch overloads which don't take a PropertyList (though I appreciate why you added these)

I would strongly recommend moving those because:

you won't then be obliged to write a load more tests
we're likely to remove these when we move this out of experimental.

Aside from that, I think this is coming together pretty well. You still need to ensure all your overloads are tested and documented.

Thanks for the contribution 👍

joeatodd · 2024-05-13T09:23:11Z

sycl/test-e2e/syclcompat/launch/launch.cpp

+  LaunchTestWithArgs<T> ltt;
+  if (ltt.skip_) // Unsupported aspect
+    return;
+


Suggested change

joeatodd · 2024-05-13T09:29:20Z

sycl/test-e2e/syclcompat/launch/launch.cpp

+  T *h_a = (T *)syclcompat::malloc_host(ltt.memsize_ * sizeof(T));
+  T *d_a = (T *)syclcompat::malloc(ltt.memsize_ * sizeof(T));


ltt.memsize_ defines the size (in bytes) of local memory used by these tests. Here (and below) you are using it as number of elements.

joeatodd · 2024-05-13T10:38:26Z

sycl/include/syclcompat/launch_experimental.hpp

+template <int Dim, auto KernelFunctor, typename... Args>
+inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>,
+                        sycl::event>
+launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, ,


What's going on here? Missing argument?

I think this also implies you need to look again at the coverage your tests are providing.

joeatodd · 2024-05-13T10:43:03Z

sycl/include/syclcompat/launch_experimental.hpp

+
+#if defined(SYCL_EXT_ONEAPI_KERNEL_PROPERTIES) &&                              \
+    defined(SYCL_EXT_ONEAPI_PROPERTIES)
+// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once


Suggested change

// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once

// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) // FIXME(@intel/syclcompat-lib-reviewers): uncomment once

joeatodd · 2024-05-13T10:50:49Z

sycl/include/syclcompat/launch_experimental.hpp

+launch(const sycl::range<Dim> &global_range,
+       const sycl::range<Dim> &local_range, std::size_t local_memory_size,
+       const PropertyList &launch_properties, const Args &...args) {
+  return launch<KernelFunctor>(
+      ::syclcompat::detail::transform_nd_range(
+          sycl::nd_range<Dim>(global_range, local_range)),
+      local_memory_size, launch_properties, ::syclcompat::get_default_queue(),
+      args...);
+}
+
+template <int Dim, auto KernelFunctor, typename... Args>
+inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args..., char *>,
+                        sycl::event>
+launch(const sycl::range<Dim> &global_range,
+       const sycl::range<Dim> &local_range, std::size_t local_memory_size,
+       const Args &...args) {
+  using PropertyList = decltype(detail::empty_property_list);
+  return launch<KernelFunctor, PropertyList>(
+      ::syclcompat::detail::transform_nd_range(
+          sycl::nd_range<Dim>(global_range, local_range)),
+      local_memory_size, detail::empty_property_list, args...);
+}


Do you need these overloads which take 2 sycl::range<Dim>, sycl::range<Dim> args? Isn't it sufficient to have sycl::nd_range overload and dim3, dim3 overload?

joeatodd · 2024-05-13T10:51:16Z

sycl/include/syclcompat/launch_experimental.hpp

+template <int Dim, auto KernelFunctor, typename PropertyList, typename... Args>
+inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>,
+                        sycl::event>
+launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range,
+       const PropertyList &launch_properties, const Args &...args) {
+  return launch<KernelFunctor>(
+      ::syclcompat::detail::transform_nd_range(
+          sycl::nd_range<3>(global_range, local_range)),
+      launch_properties, ::syclcompat::get_default_queue(), args...);
+}
+
+template <int Dim, auto KernelFunctor, typename... Args>
+inline std::enable_if_t<std::is_invocable_v<decltype(KernelFunctor), Args...>,
+                        sycl::event>
+launch(sycl::range<Dim> global_range, sycl::range<Dim> local_range, ,
+       const Args &...args) {
+  using PropertyList = decltype(detail::empty_property_list);
+  return launch<KernelFunctor, PropertyList>(
+      ::syclcompat::detail::transform_nd_range(
+          sycl::nd_range<Dim>(global_range, local_range)),
+      empty_properties_t, args...);
+}
+


As above, do you need these overloads?

joeatodd · 2024-05-13T11:01:06Z

sycl/include/syclcompat/launch_experimental.hpp

+  using PropertyList = decltype(detail::empty_property_list);
+  return launch<KernelFunctor, PropertyList>(


The overloads you have provided which don't take a PropertyList and instead pass an empty one: these are nice because they reflect how the syclcompat::launch functions will work once we integrate this properly. However, for now they are just duplicating the equivalent syclcompat::launch API but without tests. If you don't want to bother adding more tests for equivalent APIs, I would suggest pulling out these overloads.

JackAKirk · 2024-05-13T18:13:03Z

I think that there needs to be some kind of specialization that will call a new unified runtime function from the new UR cuda plugin extension I'm adding that calls cudaLaunchKernelExC with the cluster dimensions. I'm not sure if you've added this already somewhere in this code?

e·g. is there/ do you plan to add an specialization of launch/parallel_for that can specialize for the

properties  cluster_launch_property{ClusterRange(1, 2, 1)};

argument that you have here: https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140

Thanks

AD2605 · 2024-05-14T06:25:49Z

Hi @JackAKirk , thanks for having a look at this PR.

I did a little digging after your comment, (I have not looked into implementing the UR Side)
So a parallel_for overload already exists here which accepts the property list , and the calls the overloaded parallel_for_impl. Over there, we can check if the property list contains the property ClusterRange and then call the UR function you are adding.

This would also mean one can launch a kernel with cluster as

cgh.parallel_for(nd_range(...), sycl::ext::oneapi::properties{ClusterRange(...)}, [=](nd_item<Dim>{}));

I did not know this parallel for overload existed. What I do not see however, is the overloads introduced in sycl_ext_oneapi_enqueue_functions calling this overload even when properties are mentioned, and even the tests added does not seem to test launch with properties ? (https://github.com/intel/llvm/pull/13512/files#diff-f6b7355d29c87088898f102554c5a82ed290c8261ab55c0c06adb3af7a9ac932)

But yeah to answer your question, a new overload will not be required, but just a specialization of the parallel_for_impl which accepts the properties, and possibly a bug fix in the sycl_ext_oneapi_enqueue_functions ?

joeatodd · 2024-05-14T07:55:43Z

@JackAKirk, we're planning to overhaul the launch API prior to the 2025.0 release, largely in order to be able to accept whatever kernel and launch properties the user might specify in some kind of struct. So, assuming the cluster_launch_property can be used similarly to other launch properties, this shouldn't be a problem.

JackAKirk · 2024-05-14T09:32:05Z

It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. MImpl->MKernelIsCustom similar to
MImpl->MKernelIsCooperative

llvm/sycl/source/handler.cpp

Line 311 in af65855

Result = enqueueImpKernel(

,along with the additional kernel parameters, then this logic eventually makes its way to this function

llvm/sycl/source/detail/scheduler/commands.cpp

Line 2369 in af65855

static pi_result SetKernelParamsAndLaunch(

where it is used to call the appropriate the pi wrapper function e.g. piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.

JackAKirk · 2024-05-14T10:57:52Z

It looks like the most natural way to plumb it to UR would be to follow what happens for cooperative kernels, e.g. add a bool e.g. MImpl->MKernelIsCustom similar to MImpl->MKernelIsCooperative

llvm/sycl/source/handler.cpp

Line 311 in af65855

Result = enqueueImpKernel(

,along with the additional kernel parameters, then this logic eventually makes its way to this function

llvm/sycl/source/detail/scheduler/commands.cpp

Line 2369 in af65855

static pi_result SetKernelParamsAndLaunch(

where it is used to call the appropriate the pi wrapper function e.g. piEnqueueKernelLaunch, for the UR kernel launch function urEnqueueKernelLaunch. I will be making an extension for a new UR function e.g. urEnqueueKernelLaunchCustom that calls cuLaunchKernelExC in the cuda adapter. There needs to be the logic like I described above to distinguish when a cluster size is passed such that urEnqueueKernelLaunchCustom is called instead, similar to how the MKernelIsCooperative bool is currently used.

One question I had was whether you can have cooperative kernels and set launch time cluster size at the same time. It turns out that you can. Whilst their interfaces are quite different, functionally cuLaunchCooperativeKernel is a subset of cuLaunchKernelEx. I think this is possibly going to provide a natural resolution of the issues I described above:

I imagine the intention of the CUDA api is that cuLaunchKernelEx will replace cuLaunchCooperativeKernel going forward, as a more general and extensible method of providing launch time configuration for cooperative kernels/ custom distributed shared memory "cluster group" config, or anything else.
We should sync with other backend stakeholders asap, but I think in turn it would be natural if the new UR api that maps to cuLaunchKernelEx could also eventually replace urEnqueueCooperativeKernelLaunch.

This would then resolve the issues raised, because all backends could switch to using the new "launch-time-kernel" UR interface that I will add, and the logic of dpc++ can generalize the MImpl->MKernelIsCooperative bool to something more general and appropriately named e.g. MImpl->MKernelIsLaunchTimeConfig

joeatodd · 2024-06-05T09:55:34Z

Closing this for now as we went another way.

AD2605 added 18 commits April 7, 2024 17:55

add support for selecting sub group size in syclcompat launch

09b854d

add tests for the added APIs and add missing SubgroupSize template pa…

addd77d

…rameter

Merge remote-tracking branch 'origin/sycl' into atharva/syclcompat_re…

eeccee7

…qd_sg_size

Merge remote-tracking branch 'upstream/sycl' into atharva/syclcompat_…

468592e

…reqd_sg_size

Merge remote-tracking branch 'upstream/sycl' into atharva/syclcompat_…

3bd3544

…reqd_sg_size

WIP2

200df9b

fix compilation

5783c39

Merge remote-tracking branch 'upstream/sycl' into atharva/syclcompat_…

4813da6

…reqd_sg_size

Merge remote-tracking branch 'upstream/sycl' into atharva/syclcompat_…

4e8b148

…reqd_sg_size

Merge remote-tracking branch 'upstream/sycl' into atharva/syclcompat_…

8548481

…reqd_sg_size

Merge branch 'atharva/syclcompat_reqd_sg_size' into atharva/sycl_comp…

68da2e9

…at_launch_w_properties

restore launch.hpp

adc93da

add more overloads, fix arg order, update tests

fa3b307

Merge remote-tracking branch 'upstream/sycl' into atharva/sycl_compat…

7050509

…_launch_w_properties

WIP changes

456aa5d

add missing overload causing compilation errors

f8af991

Merge remote-tracking branch 'upstream/sycl' into atharva/sycl_compat…

a38e031

…_launch_w_properties

add new line

7c4aef8

AD2605 requested a review from a team as a code owner May 2, 2024 15:41

AD2605 mentioned this pull request May 2, 2024

[SYCL][COMPAT] Add support for requesting sub group size in syclcompat::launch #13525

Closed

AD2605 temporarily deployed to WindowsCILock May 2, 2024 20:49 — with GitHub Actions Inactive

AD2605 temporarily deployed to WindowsCILock May 2, 2024 21:30 — with GitHub Actions Inactive

joeatodd suggested changes May 6, 2024

View reviewed changes

aacostadiaz reviewed May 6, 2024

View reviewed changes

review comments 1, without modifying tests

39c389e

Alcpz changed the title ~~[SYCL][COMPAT] Launch kernels using the enqueue functions exntesions~~ [SYCL][COMPAT] Launch kernels using the enqueue functions extensions May 8, 2024

AD2605 added 2 commits May 9, 2024 10:55

add missing decltype

828ba18

add further tests

0373fe0

AD2605 added 2 commits May 10, 2024 04:31

Merge remote-tracking branch 'upstream/sycl' into atharva/sycl_compat…

f7a5cb0

…_launch_w_properties

removed deleted header

6f4ecc6

AD2605 temporarily deployed to WindowsCILock May 13, 2024 10:40 — with GitHub Actions Inactive

joeatodd suggested changes May 13, 2024

View reviewed changes

compilation fixes

9ae1ded

JackAKirk mentioned this pull request May 15, 2024

[exp] Add first draft of launch attributes extension. oneapi-src/unified-runtime#1610

Closed

joeatodd closed this Jun 5, 2024

		T h_a = (T )syclcompat::malloc_host(ltt.memsize_ * sizeof(T));
		T d_a = (T )syclcompat::malloc(ltt.memsize_ * sizeof(T));

	// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) uncomment once
	// defined(SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS) // FIXME(@intel/syclcompat-lib-reviewers): uncomment once

		using PropertyList = decltype(detail::empty_property_list);
		return launch<KernelFunctor, PropertyList>(

Comments

Conversation

AD2605 commented May 2, 2024

Uh oh!

joeatodd commented May 6, 2024

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joeatodd commented May 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AD2605 commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeatodd commented May 14, 2024

Uh oh!

JackAKirk commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeatodd commented Jun 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JackAKirk commented May 13, 2024 •

edited

Loading

AD2605 commented May 14, 2024 •

edited

Loading

JackAKirk commented May 14, 2024 •

edited

Loading

JackAKirk commented May 14, 2024 •

edited

Loading