[CUDA] use advise ACCESSED_BY for all devs in ur_context_handle_t for managed memory#1717
[CUDA] use advise ACCESSED_BY for all devs in ur_context_handle_t for managed memory#1717JackAKirk wants to merge 8 commits intooneapi-src:mainfrom
ur_context_handle_t for managed memory#1717Conversation
When allocating new managed memory. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
ur_context_handle_t for managed memory
source/adapters/cuda/usm.cpp
Outdated
| ScopedContext Active(Device); | ||
| UR_CHECK_ERROR(cuMemAllocManaged((CUdeviceptr *)ResultPtr, Size, | ||
| CU_MEM_ATTACH_GLOBAL)); | ||
| for (const auto &device : Context->getDevices()) { |
There was a problem hiding this comment.
nit: I don't claim to know anything about the coding style used here, but we have Context, Device, Size, and Err in this function, and we're introducing lower-case device here. Is it not inconsistent, and isn't it confusing to have two uses of "device" in the same function?
There was a problem hiding this comment.
Agreed. I think in general good to go for hDevice for the device arg and then Dev for the local device. But I don't think the device arg is actually needed? Maybe better to do:
ScopedContext Active(Context->getDevices()[0]);
And just remove the Device arg altogether
There was a problem hiding this comment.
Agreed. I think in general good to go for
hDevicefor the device arg and thenDevfor the local device. But I don't think the device arg is actually needed? Maybe better to do:ScopedContext Active(Context->getDevices()[0]);And just remove the
Devicearg altogether
OK I'll switch the naming. But I don't think any device param can be removed.
What is happening here is that a stream within CUcontext of the Device that was associated with a runtime (sycl) queue is allocating managed memory. Probably in many cases we could ignore this prescription provided by the programmer and just use the first device from the Context->getDevices()[0] and this wouldn't matter to a user. However I think it could do.
Say for example an application from User A is on a node but sees all devices in the platform on that node (in perlmutter there are 4 gpus). If the Context corresponds to a sycl default context then all four gpus will be in the context. The user provides device 1 as the device whos allocated CUcontext they prescribe to allocate the managed memory via. User A only wants to use device 0. We mark accessed_by for all devices but this isn't used - not a big deal, it is cheap
User B is using device 0 for a different application on the same node. I imagine that if user B is saturating the gpu, it might be a bad idea for us to break the prescription from user A and use device 0 instead.
Probably it doesn't make a big difference in practice, but hopefully the above details there is a potential difference. If the interface used provides a specific device, then I think this should be used as the "command device"
There was a problem hiding this comment.
In that case we should also be calling setAccessedBy for the hDevice param, or asserting that hDevice is within the context
There was a problem hiding this comment.
OK I'll add an assert.
There was a problem hiding this comment.
Actually I still think it would be better to not do asserts etc, and just leave it as it is. Since I might have to loop over 8 gpus, and in the future this number will probably increase. I think that if people don't use the default context with all devices, then it can be their responsibility to check they are actually using one of the devices they said they wanted in the context. If they don't then the behaviour will just be how it was before, and it is unlikely that many users will encounter this situation.
However I'll do whatever you prefer? I just find it hard to pick between the options, and if it is me picking, I choose how I did it now.
There was a problem hiding this comment.
I've just ended up calling the advise also via the commandDevice. This will lead to a duplication in normal circumstances, but I've checked you can advise twice without an error.
There was a problem hiding this comment.
Yeah that sounds good to me.
Covers case where command device outside the ur Context. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
|
Closing in favour of #1774 |
This change improves the performance of managed memory in the cuda backend for all devices tested, by resulting in better default on device caching of the managed memory. We have tested this using a range of use cases. The performance benefit increases with increased number of devices, but can still be considerable when using only a single device if the managed memory is accessed frequently.
There appears to be no drawbacks to doing this by default (no use case was found where this led to a drop in performance), and the observed behaviour with respect to copies to and from the host now matches to what we observe to happen in the cuda runtime api. If a device in the picontext does not ever access the shared memory this use case also does not lead to a drop in performance even though we mark all devices in the picontext as ACCESSED_BY for this managed memory.