[exp] Add first draft of launch attributes extension.#1610
[exp] Add first draft of launch attributes extension.#1610JackAKirk wants to merge 7 commits intooneapi-src:mainfrom
Conversation
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
|
also CC @0x12CC |
OK I see now from https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/kernel.cpp#L278 |
This means we only need two functions instead of two. This completes the first draft interface design. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
|
Closed in favour of #1643 |
Primary motivation for this work
The primary motivation of this extension is to provide a mechanism to launch a kernel using the cluster group property from this DPC++ extension:
https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140
We don't initially need any of the features from the above proposal apart from the actual kernel launch (for example we don't need the cluster group class at all initially, we just need to pass the cluster size to the kernel launch).
This is one of the main features for nvidia gpus (sm90 onwards).
We want to support this feature ASAP, so other usages of this API (like cooperative groups) are not expected to be supported initially.
Considerations of other backends
However the part of this work added in UR is I think relevant to other backends.
What I stated there wasn't giving the complete picture, because I think that the cuda api I described there:
cudaLaunchKernelExC(driver api version that is equivalent is namedcuLaunchKernelEx), actually can encompass (and replace)cudaLaunchCooperativeKernel.Resultantly, I think it would be good at this stage (rather than wait for any potential issues when we get to the point I describe in this comment: intel/llvm#13642 (comment)) to get some feedback from Intel on this proposal, because it is likely to interact with dpc++ scheduler code that is not cuda specific. Taking into account the interaction with how the dpc++ scheduler currently deals with cooperative groups is going to probably be necessary: see my discussion here intel/llvm#13642 (comment)
A key question for Intel developers is whether they think it is a good idea to have
urKernelSetLaunchConfigExpeventually replaceurEnqueueCooperativeKernelLaunch, instead of having three different kernel launch apis.Outline of how this extension is expected to work
A first draft (Note that I expect this to change but the main idea is presented here for feedback) for the minimal set of UR apis that we need to achieve the "Primary motivation for this work" is defined in this PR (read also the .rst that is committed etc):
1. we need to be able to set the launch attribute:
(we need an equivalent for this cuda code)
How this extension proposes we do it:
Note that one other native cuda attribute is
CU_LAUNCH_ATTRIBUTE_COOPERATIVE, which allows the possibility of launching cooperative kernels as I mentioned before. Ideally we would confirm whether Intel thinks that the abstraction described here would similarly allow intel hardware to launch cooperative kernels from the set of UR abstractions proposed here.exp_launch_attr_handle_twill have a backend specific definition, that will e.g. allow the cuda adapter to call the native CUDA driver API code from above. Other backends could have their own implementations to deal with e.g. cooperative kernels or other future kernel config features.2. Then once we have such a set an array of attributes, we need to use them to set the kernel config.
(cuda code)
For the equivalent in UR I propose something like
Like
exp_launch_attr_handle_t,ur_exp_launch_config_handle_twill have a backend specific definition, that well e.g. allow the cuda adapter to call the native CUDA driver API code from above.3. Then to launch the kernel, we map closely to the native cuda interface
cuLaunchKernelEx, although ur kernel handle is more abstract so there is one less function argument for kernel args, but we will eventually need more arguments for events etc to deal with sycl::queues that are not in_order. But ignoring these details that can be decided later, quite simply this is the basic idea:(cuda code)
draft proposal for UR equivalent
Any feedback on the proposal is greatly appreciated. E.g. is there a preference to use
exp_launch_attr_handle_tfor kernel parameters like blockDim that atm in UR we pass explicitly tourEnqueueKernel, or do we want to also pass these parameters explicitly tourKernelSetLaunchConfigExp, or evenEnqueueKernelLaunchCustomExp. cc @joeatodd @AD2605 @mehdi-goli