Skip to content

Comments

[exp] Add first draft of launch attributes extension.#1610

Closed
JackAKirk wants to merge 7 commits intooneapi-src:mainfrom
JackAKirk:cuda-kernel-launch-time-config
Closed

[exp] Add first draft of launch attributes extension.#1610
JackAKirk wants to merge 7 commits intooneapi-src:mainfrom
JackAKirk:cuda-kernel-launch-time-config

Conversation

@JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented May 15, 2024

Primary motivation for this work

The primary motivation of this extension is to provide a mechanism to launch a kernel using the cluster group property from this DPC++ extension:
https://github.com/intel/llvm/pull/13594/files#diff-96a41bacbe4aca8737244a37e62f63c18fccd2274588d37c26ca421f2fb857a0R140
We don't initially need any of the features from the above proposal apart from the actual kernel launch (for example we don't need the cluster group class at all initially, we just need to pass the cluster size to the kernel launch).
This is one of the main features for nvidia gpus (sm90 onwards).
We want to support this feature ASAP, so other usages of this API (like cooperative groups) are not expected to be supported initially.

Considerations of other backends

However the part of this work added in UR is I think relevant to other backends.

Resultantly, I think it would be good at this stage (rather than wait for any potential issues when we get to the point I describe in this comment: intel/llvm#13642 (comment)) to get some feedback from Intel on this proposal, because it is likely to interact with dpc++ scheduler code that is not cuda specific. Taking into account the interaction with how the dpc++ scheduler currently deals with cooperative groups is going to probably be necessary: see my discussion here intel/llvm#13642 (comment)
A key question for Intel developers is whether they think it is a good idea to have urKernelSetLaunchConfigExp eventually replace urEnqueueCooperativeKernelLaunch, instead of having three different kernel launch apis.

Outline of how this extension is expected to work

A first draft (Note that I expect this to change but the main idea is presented here for feedback) for the minimal set of UR apis that we need to achieve the "Primary motivation for this work" is defined in this PR (read also the .rst that is committed etc):

1. we need to be able to set the launch attribute:
(we need an equivalent for this cuda code)

cudaLaunchAttribute launch_attribute[1];
    launch_attribute[0].id = cudaLaunchAttributeClusterDimension;
    launch_attribute[0].val.clusterDim.x = cluster_dims.x;
    launch_attribute[0].val.clusterDim.y = cluster_dims.y;
    launch_attribute[0].val.clusterDim.z = cluster_dims.z;

How this extension proposes we do it:

int clusterDims[3] = { 1024,10,1 };
exp_launch_attr_handle_t clusterAttr;
//set the the native kernel launch attribute corresponding to attribute id UR_CLUSTER_DIMENSION
urKernelSetLaunchAttrExp(clusterAttr, UR_CLUSTER_DIMENSION, sizeof(int)*3, reinterpret_cast<void*>(clusterDims));

Note that one other native cuda attribute is CU_LAUNCH_ATTRIBUTE_COOPERATIVE, which allows the possibility of launching cooperative kernels as I mentioned before. Ideally we would confirm whether Intel thinks that the abstraction described here would similarly allow intel hardware to launch cooperative kernels from the set of UR abstractions proposed here.

exp_launch_attr_handle_t will have a backend specific definition, that will e.g. allow the cuda adapter to call the native CUDA driver API code from above. Other backends could have their own implementations to deal with e.g. cooperative kernels or other future kernel config features.

2. Then once we have such a set an array of attributes, we need to use them to set the kernel config.
(cuda code)

cudaLaunchConfig_t launch_config;
    launch_config.gridDim = {grid_dims.x, grid_dims.y, grid_dims.z};
    launch_config.blockDim = {block_dims.x, block_dims.y, block_dims.z};
    launch_config.dynamicSmemBytes = smem_size;
    launch_config.stream = cuda_stream;
    launch_config.attrs = launch_attribute;
    launch_config.numAttrs = 1;

For the equivalent in UR I propose something like

ur_exp_launch_config_handle_t launchConfig;
urKernelSetLaunchConfigExp(launchConfig, &clusterAttr, 1); //todo either need attributes for grid/block dim, stream(queue) etc; or extra explicit parameters for these as a minimal. (details)...

Like exp_launch_attr_handle_t, ur_exp_launch_config_handle_t will have a backend specific definition, that well e.g. allow the cuda adapter to call the native CUDA driver API code from above.

3. Then to launch the kernel, we map closely to the native cuda interface cuLaunchKernelEx, although ur kernel handle is more abstract so there is one less function argument for kernel args, but we will eventually need more arguments for events etc to deal with sycl::queues that are not in_order. But ignoring these details that can be decided later, quite simply this is the basic idea:
(cuda code)

cuLaunchKernelEx(&launch_config, kernel, kernel_params);

draft proposal for UR equivalent

ur_kernel_handle_t kernel = getKernelFromDPCruntime();
EnqueueKernelLaunchCustomExp(launchConfig, kernel);

Any feedback on the proposal is greatly appreciated. E.g. is there a preference to use exp_launch_attr_handle_t for kernel parameters like blockDim that atm in UR we pass explicitly to urEnqueueKernel, or do we want to also pass these parameters explicitly to urKernelSetLaunchConfigExp, or even EnqueueKernelLaunchCustomExp. cc @joeatodd @AD2605 @mehdi-goli

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@github-actions github-actions bot added specification Changes or additions to the specification experimental Experimental feature additions/changes/specification labels May 15, 2024
@JackAKirk
Copy link
Contributor Author

also CC @0x12CC

@JackAKirk
Copy link
Contributor Author

also CC @0x12CC

OK I see now from https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/kernel.cpp#L278
that urEnqueueCooperativeKernelLaunchExp is just calling urEnqueueKernelLaunch for l0, which makes l0 support trivial.
I'm going to go ahead and absorb the urKernelSetLaunchConfigExp into EnqueueKernelLaunchCustomExp and make EnqueueKernelLaunchCustomExp take the same parameters as urEnqueueKernelLaunch but with the extra array of exp_launch_attr_handle_t and numAttrs. Then we don't need to expose ur_exp_launch_config_handle_t.
But if anyone in Intel still wants to give any input on this that would be welcome. It still might be useful more generally for backends other than CUDA as an extensible kernel launch function.

This means we only need two functions instead of two.
This completes the first draft interface design.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk changed the title [exp] Add first draft of launch config extension. [exp] Add first draft of launch attributes extension. May 16, 2024
JackAKirk added 3 commits May 16, 2024 17:29
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@github-actions github-actions bot added the loader Loader related feature/bug label May 17, 2024
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@github-actions github-actions bot added the cuda CUDA adapter specific issues label May 17, 2024
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

Closed in favour of #1643

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda CUDA adapter specific issues experimental Experimental feature additions/changes/specification loader Loader related feature/bug specification Changes or additions to the specification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant