[exp] Add first draft of launch attributes extension. by JackAKirk · Pull Request #1610 · oneapi-src/unified-runtime

JackAKirk · 2024-05-15T15:00:54Z

Primary motivation for this work

The primary motivation of this extension is to provide a mechanism to launch a kernel using the cluster group property from this DPC++ extension:
https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140
We don't initially need any of the features from the above proposal apart from the actual kernel launch (for example we don't need the cluster group class at all initially, we just need to pass the cluster size to the kernel launch).
This is one of the main features for nvidia gpus (sm90 onwards).
We want to support this feature ASAP, so other usages of this API (like cooperative groups) are not expected to be supported initially.

Considerations of other backends

However the part of this work added in UR is I think relevant to other backends.

@steffenlarsen Apparently you are working on a similar kernel configuration feature exposed at a higher level in DPC++, so it might be good to get your feedback on this, and whether it relates to your work.
Also FYI @Pennycook @gmlueck this is relevant to the discussion around this comment: https://github.com/intel/llvm/pull/13389/files/04019d97e7623fd1ad4eb1d3ba03191adeaeb3c5#r1584956078
What I stated there wasn't giving the complete picture, because I think that the cuda api I described there: cudaLaunchKernelExC (driver api version that is equivalent is named cuLaunchKernelEx), actually can encompass (and replace) cudaLaunchCooperativeKernel.

Resultantly, I think it would be good at this stage (rather than wait for any potential issues when we get to the point I describe in this comment: intel/llvm#13642 (comment)) to get some feedback from Intel on this proposal, because it is likely to interact with dpc++ scheduler code that is not cuda specific. Taking into account the interaction with how the dpc++ scheduler currently deals with cooperative groups is going to probably be necessary: see my discussion here intel/llvm#13642 (comment)
A key question for Intel developers is whether they think it is a good idea to have urKernelSetLaunchConfigExp eventually replace urEnqueueCooperativeKernelLaunch, instead of having three different kernel launch apis.

Outline of how this extension is expected to work

A first draft (Note that I expect this to change but the main idea is presented here for feedback) for the minimal set of UR apis that we need to achieve the "Primary motivation for this work" is defined in this PR (read also the .rst that is committed etc):

1. we need to be able to set the launch attribute:
(we need an equivalent for this cuda code)

cudaLaunchAttribute launch_attribute[1];
    launch_attribute[0].id = cudaLaunchAttributeClusterDimension;
    launch_attribute[0].val.clusterDim.x = cluster_dims.x;
    launch_attribute[0].val.clusterDim.y = cluster_dims.y;
    launch_attribute[0].val.clusterDim.z = cluster_dims.z;

How this extension proposes we do it:

int clusterDims[3] = { 1024,10,1 };
exp_launch_attr_handle_t clusterAttr;
//set the the native kernel launch attribute corresponding to attribute id UR_CLUSTER_DIMENSION
urKernelSetLaunchAttrExp(clusterAttr, UR_CLUSTER_DIMENSION, sizeof(int)*3, reinterpret_cast<void*>(clusterDims));

Note that one other native cuda attribute is CU_LAUNCH_ATTRIBUTE_COOPERATIVE, which allows the possibility of launching cooperative kernels as I mentioned before. Ideally we would confirm whether Intel thinks that the abstraction described here would similarly allow intel hardware to launch cooperative kernels from the set of UR abstractions proposed here.

exp_launch_attr_handle_t will have a backend specific definition, that will e.g. allow the cuda adapter to call the native CUDA driver API code from above. Other backends could have their own implementations to deal with e.g. cooperative kernels or other future kernel config features.

2. Then once we have such a set an array of attributes, we need to use them to set the kernel config.
(cuda code)

cudaLaunchConfig_t launch_config;
    launch_config.gridDim = {grid_dims.x, grid_dims.y, grid_dims.z};
    launch_config.blockDim = {block_dims.x, block_dims.y, block_dims.z};
    launch_config.dynamicSmemBytes = smem_size;
    launch_config.stream = cuda_stream;
    launch_config.attrs = launch_attribute;
    launch_config.numAttrs = 1;

For the equivalent in UR I propose something like

ur_exp_launch_config_handle_t launchConfig;
urKernelSetLaunchConfigExp(launchConfig, &clusterAttr, 1); //todo either need attributes for grid/block dim, stream(queue) etc; or extra explicit parameters for these as a minimal. (details)...

Like exp_launch_attr_handle_t, ur_exp_launch_config_handle_t will have a backend specific definition, that well e.g. allow the cuda adapter to call the native CUDA driver API code from above.

3. Then to launch the kernel, we map closely to the native cuda interface cuLaunchKernelEx, although ur kernel handle is more abstract so there is one less function argument for kernel args, but we will eventually need more arguments for events etc to deal with sycl::queues that are not in_order. But ignoring these details that can be decided later, quite simply this is the basic idea:
(cuda code)

cuLaunchKernelEx(&launch_config, kernel, kernel_params);

draft proposal for UR equivalent

ur_kernel_handle_t kernel = getKernelFromDPCruntime();
EnqueueKernelLaunchCustomExp(launchConfig, kernel);

Any feedback on the proposal is greatly appreciated. E.g. is there a preference to use exp_launch_attr_handle_t for kernel parameters like blockDim that atm in UR we pass explicitly to urEnqueueKernel, or do we want to also pass these parameters explicitly to urKernelSetLaunchConfigExp, or even EnqueueKernelLaunchCustomExp. cc @joeatodd @AD2605 @mehdi-goli

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2024-05-15T15:32:32Z

also CC @0x12CC

JackAKirk · 2024-05-16T10:45:10Z

also CC @0x12CC

OK I see now from https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/kernel.cpp#L278
that urEnqueueCooperativeKernelLaunchExp is just calling urEnqueueKernelLaunch for l0, which makes l0 support trivial.
I'm going to go ahead and absorb the urKernelSetLaunchConfigExp into EnqueueKernelLaunchCustomExp and make EnqueueKernelLaunchCustomExp take the same parameters as urEnqueueKernelLaunch but with the extra array of exp_launch_attr_handle_t and numAttrs. Then we don't need to expose ur_exp_launch_config_handle_t.
But if anyone in Intel still wants to give any input on this that would be welcome. It still might be useful more generally for backends other than CUDA as an extensible kernel launch function.

This means we only need two functions instead of two. This completes the first draft interface design. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2024-05-22T14:40:43Z

Closed in favour of #1643

Add first draft of launch config extension.

ee900d8

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

github-actions bot added specification Changes or additions to the specification experimental Experimental feature additions/changes/specification labels May 15, 2024

Absorbed config into KernelLaunchCustom.

de67e1b

This means we only need two functions instead of two. This completes the first draft interface design. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk changed the title ~~[exp] Add first draft of launch config extension.~~ [exp] Add first draft of launch attributes extension. May 16, 2024

JackAKirk added 3 commits May 16, 2024 17:29

Added back a missing parameter etc.

e72700d

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

make rst and yml consistent.

ea86a3b

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Add successful api build.

6098de4

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

github-actions bot added the loader Loader related feature/bug label May 17, 2024

Add cuda impl of extension.

db01ace

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

github-actions bot added the cuda CUDA adapter specific issues label May 17, 2024

Add some param conditions.

73969b3

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk closed this May 22, 2024

JackAKirk mentioned this pull request May 22, 2024

[EXP][CUDA] Add initial version of (kernel) Launch Properties extension. #1643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[exp] Add first draft of launch attributes extension.#1610

[exp] Add first draft of launch attributes extension.#1610
JackAKirk wants to merge 7 commits intooneapi-src:mainfrom
JackAKirk:cuda-kernel-launch-time-config

JackAKirk commented May 15, 2024 •

edited

Loading

Uh oh!

JackAKirk commented May 15, 2024

Uh oh!

JackAKirk commented May 16, 2024

Uh oh!

JackAKirk commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

JackAKirk commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented May 15, 2024

Uh oh!

JackAKirk commented May 16, 2024

Uh oh!

JackAKirk commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JackAKirk commented May 15, 2024 •

edited

Loading