Skip to content

Comments

[CUDA] use advise ACCESSED_BY for all devs in ur_context_handle_t for managed memory#1717

Closed
JackAKirk wants to merge 8 commits intooneapi-src:mainfrom
JackAKirk:cuda-malloc-managed-optimize
Closed

[CUDA] use advise ACCESSED_BY for all devs in ur_context_handle_t for managed memory#1717
JackAKirk wants to merge 8 commits intooneapi-src:mainfrom
JackAKirk:cuda-malloc-managed-optimize

Conversation

@JackAKirk
Copy link
Contributor

This change improves the performance of managed memory in the cuda backend for all devices tested, by resulting in better default on device caching of the managed memory. We have tested this using a range of use cases. The performance benefit increases with increased number of devices, but can still be considerable when using only a single device if the managed memory is accessed frequently.
There appears to be no drawbacks to doing this by default (no use case was found where this led to a drop in performance), and the observed behaviour with respect to copies to and from the host now matches to what we observe to happen in the cuda runtime api. If a device in the picontext does not ever access the shared memory this use case also does not lead to a drop in performance even though we mark all devices in the picontext as ACCESSED_BY for this managed memory.

When allocating new managed memory.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk requested a review from a team as a code owner June 5, 2024 17:51
@JackAKirk JackAKirk requested a review from npmiller June 5, 2024 17:51
@github-actions github-actions bot added the cuda CUDA adapter specific issues label Jun 5, 2024
@JackAKirk JackAKirk changed the title [CUDA] use advise ACCESSED_BY for all devs in picontext for managed memory [CUDA] use advise ACCESSED_BY for all devs in URcontext for managed memory Jun 5, 2024
@JackAKirk JackAKirk changed the title [CUDA] use advise ACCESSED_BY for all devs in URcontext for managed memory [CUDA] use advise ACCESSED_BY for all devs in ur_context_handle_t for managed memory Jun 5, 2024
ScopedContext Active(Device);
UR_CHECK_ERROR(cuMemAllocManaged((CUdeviceptr *)ResultPtr, Size,
CU_MEM_ATTACH_GLOBAL));
for (const auto &device : Context->getDevices()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't claim to know anything about the coding style used here, but we have Context, Device, Size, and Err in this function, and we're introducing lower-case device here. Is it not inconsistent, and isn't it confusing to have two uses of "device" in the same function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think in general good to go for hDevice for the device arg and then Dev for the local device. But I don't think the device arg is actually needed? Maybe better to do:

ScopedContext Active(Context->getDevices()[0]);

And just remove the Device arg altogether

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think in general good to go for hDevice for the device arg and then Dev for the local device. But I don't think the device arg is actually needed? Maybe better to do:

ScopedContext Active(Context->getDevices()[0]);

And just remove the Device arg altogether

OK I'll switch the naming. But I don't think any device param can be removed.

What is happening here is that a stream within CUcontext of the Device that was associated with a runtime (sycl) queue is allocating managed memory. Probably in many cases we could ignore this prescription provided by the programmer and just use the first device from the Context->getDevices()[0] and this wouldn't matter to a user. However I think it could do.
Say for example an application from User A is on a node but sees all devices in the platform on that node (in perlmutter there are 4 gpus). If the Context corresponds to a sycl default context then all four gpus will be in the context. The user provides device 1 as the device whos allocated CUcontext they prescribe to allocate the managed memory via. User A only wants to use device 0. We mark accessed_by for all devices but this isn't used - not a big deal, it is cheap
User B is using device 0 for a different application on the same node. I imagine that if user B is saturating the gpu, it might be a bad idea for us to break the prescription from user A and use device 0 instead.
Probably it doesn't make a big difference in practice, but hopefully the above details there is a potential difference. If the interface used provides a specific device, then I think this should be used as the "command device"

Copy link
Contributor

@hdelan hdelan Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we should also be calling setAccessedBy for the hDevice param, or asserting that hDevice is within the context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll add an assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I still think it would be better to not do asserts etc, and just leave it as it is. Since I might have to loop over 8 gpus, and in the future this number will probably increase. I think that if people don't use the default context with all devices, then it can be their responsibility to check they are actually using one of the devices they said they wanted in the context. If they don't then the behaviour will just be how it was before, and it is unlikely that many users will encounter this situation.

However I'll do whatever you prefer? I just find it hard to pick between the options, and if it is me picking, I choose how I did it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just ended up calling the advise also via the commandDevice. This will lead to a duplication in normal circumstances, but I've checked you can advise twice without an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good to me.

Covers case where command device outside the ur Context.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
JackAKirk added 2 commits June 7, 2024 07:54
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk added the ready to merge Added to PR's which are ready to merge label Jun 10, 2024
@JackAKirk
Copy link
Contributor Author

Closing in favour of #1774

@JackAKirk JackAKirk closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda CUDA adapter specific issues ready to merge Added to PR's which are ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants