-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA][USM] Improve CUDA USM memory allocation functions #1577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
13c2631
9c33360
92065df
6d54863
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3396,17 +3396,29 @@ pi_result cuda_piEnqueueMemUnmap(pi_queue command_queue, pi_mem memobj, | |
pi_result cuda_piextUSMHostAlloc(void **result_ptr, pi_context context, | ||
pi_usm_mem_properties *properties, size_t size, | ||
pi_uint32 alignment) { | ||
// from empirical testing with CUDA 10.2 on a Tesla K40 | ||
static constexpr pi_uint32 max_alignment = 0x200; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trying to find any CUDA docs that specify the alignment but only finding queryable values for textures and the ominous "The allocated memory is suitably aligned for any kind of variable."... Looking at CUDA types I wonder if that means that 16-byte alignment is the minimum (as I guess needed to align a double4 - see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I asked NVIDIA about it, they opened a ticket internally to clarify the documentation, but no idea of the time scale. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for pordding them! 🤞 that they see the priority in it! |
||
|
||
// enforce a valid pointer to the allocated memory | ||
assert(result_ptr != nullptr); | ||
// check the the context is valid | ||
assert(context != nullptr); | ||
// check that the property list is empty | ||
assert(properties == nullptr); | ||
// check that the alignment is not larger than max_alignment, and is either 0 | ||
// or a power of 2 | ||
assert(alignment <= max_alignment && (alignment & (alignment - 1)) == 0); | ||
|
||
pi_result result = PI_SUCCESS; | ||
try { | ||
ScopedContext active(context); | ||
result = PI_CHECK_ERROR(cuMemAllocHost(result_ptr, size)); | ||
} catch (pi_result error) { | ||
result = error; | ||
} | ||
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0); | ||
// check that the result is suitable aligned | ||
assert((alignment == 0) || | ||
(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0)); | ||
return result; | ||
} | ||
|
||
|
@@ -3416,18 +3428,31 @@ pi_result cuda_piextUSMDeviceAlloc(void **result_ptr, pi_context context, | |
pi_device device, | ||
pi_usm_mem_properties *properties, | ||
size_t size, pi_uint32 alignment) { | ||
// from empirical testing with CUDA 10.2 on a Tesla K40 | ||
static constexpr pi_uint32 max_alignment = 0x200; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we make the constant global to share it between the different functions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was considering if host and device could have different default alignments - but given the lack of details, I guess we could use a global There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a valid thought! |
||
|
||
// enforce a valid pointer to the allocated memory | ||
assert(result_ptr != nullptr); | ||
// check the the context is valid | ||
assert(context != nullptr); | ||
// check that the device is valid | ||
assert(device != nullptr); | ||
// check that the property list is empty | ||
assert(properties == nullptr); | ||
// check that the alignment is not larger than max_alignment, and is either 0 | ||
// or a power of 2 | ||
assert(alignment <= max_alignment && (alignment & (alignment - 1)) == 0); | ||
|
||
pi_result result = PI_SUCCESS; | ||
try { | ||
ScopedContext active(context); | ||
result = PI_CHECK_ERROR(cuMemAlloc((CUdeviceptr *)result_ptr, size)); | ||
} catch (pi_result error) { | ||
result = error; | ||
} | ||
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0); | ||
// check that the result is suitable aligned | ||
assert((alignment == 0) || | ||
(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0)); | ||
return result; | ||
} | ||
|
||
|
@@ -3437,10 +3462,21 @@ pi_result cuda_piextUSMSharedAlloc(void **result_ptr, pi_context context, | |
pi_device device, | ||
pi_usm_mem_properties *properties, | ||
size_t size, pi_uint32 alignment) { | ||
// from empirical testing with CUDA 10.2 on a Tesla K40 | ||
static constexpr pi_uint32 max_alignment = 0x200; | ||
|
||
// enforce a valid pointer to the allocated memory | ||
assert(result_ptr != nullptr); | ||
// check the the context is valid | ||
assert(context != nullptr); | ||
// check that the device is valid | ||
assert(device != nullptr); | ||
// check that the property list is empty | ||
assert(properties == nullptr); | ||
// check that the alignment is not larger than max_alignment, and is either 0 | ||
// or a power of 2 | ||
assert(alignment <= max_alignment && (alignment & (alignment - 1)) == 0); | ||
|
||
pi_result result = PI_SUCCESS; | ||
try { | ||
ScopedContext active(context); | ||
|
@@ -3449,7 +3485,9 @@ pi_result cuda_piextUSMSharedAlloc(void **result_ptr, pi_context context, | |
} catch (pi_result error) { | ||
result = error; | ||
} | ||
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0); | ||
// check that the result is suitable aligned | ||
assert((alignment == 0) || | ||
(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0)); | ||
return result; | ||
} | ||
fwyzard marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
@@ -3481,8 +3519,12 @@ pi_result cuda_piextUSMEnqueueMemset(pi_queue queue, void *ptr, pi_int32 value, | |
pi_uint32 num_events_in_waitlist, | ||
const pi_event *events_waitlist, | ||
pi_event *event) { | ||
// enforce that the queue is valid | ||
assert(queue != nullptr); | ||
assert(ptr != nullptr); | ||
// check that the pointer is valid | ||
if (ptr == nullptr) { | ||
romanovvlad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return PI_INVALID_VALUE; | ||
} | ||
CUstream cuStream = queue->get(); | ||
pi_result result = PI_SUCCESS; | ||
std::unique_ptr<_pi_event> event_ptr{nullptr}; | ||
|
@@ -3514,9 +3556,12 @@ pi_result cuda_piextUSMEnqueueMemcpy(pi_queue queue, pi_bool blocking, | |
pi_uint32 num_events_in_waitlist, | ||
const pi_event *events_waitlist, | ||
pi_event *event) { | ||
// enforce that the queue is valid | ||
assert(queue != nullptr); | ||
assert(dst_ptr != nullptr); | ||
assert(src_ptr != nullptr); | ||
// check that the source and destination pointers are valid | ||
if (dst_ptr == nullptr || src_ptr == nullptr) { | ||
romanovvlad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return PI_INVALID_VALUE; | ||
} | ||
CUstream cuStream = queue->get(); | ||
pi_result result = PI_SUCCESS; | ||
std::unique_ptr<_pi_event> event_ptr{nullptr}; | ||
|
@@ -3553,8 +3598,12 @@ pi_result cuda_piextUSMEnqueuePrefetch(pi_queue queue, const void *ptr, | |
pi_uint32 num_events_in_waitlist, | ||
const pi_event *events_waitlist, | ||
pi_event *event) { | ||
// enforce that the queue is valid | ||
assert(queue != nullptr); | ||
assert(ptr != nullptr); | ||
// check that the pointer is valid | ||
if (ptr == nullptr) { | ||
romanovvlad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return PI_INVALID_VALUE; | ||
} | ||
CUstream cuStream = queue->get(); | ||
pi_result result = PI_SUCCESS; | ||
std::unique_ptr<_pi_event> event_ptr{nullptr}; | ||
|
@@ -3589,8 +3638,12 @@ pi_result cuda_piextUSMEnqueuePrefetch(pi_queue queue, const void *ptr, | |
pi_result cuda_piextUSMEnqueueMemAdvise(pi_queue queue, const void *ptr, | ||
size_t length, int advice, | ||
pi_event *event) { | ||
// enforce that the queue is valid | ||
assert(queue != nullptr); | ||
assert(ptr != nullptr); | ||
// check that the pointer is valid | ||
if (ptr == nullptr) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please, return assert. |
||
return PI_INVALID_VALUE; | ||
} | ||
// TODO implement a mapping to cuMemAdvise once the expected behaviour | ||
// of piextUSMEnqueueMemAdvise is detailed in the USM extension | ||
return cuda_piEnqueueEventsWait(queue, 0, nullptr, event); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
// XFAIL: cuda | ||
// piextUSM*Alloc functions for CUDA are not behaving as described in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test is failing in our CI. Any ideas why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can have a look later today or tomorrow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +@bjoernknafla, just in case he might know if this is a known issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That error could come from CUDA event handling needing a thread pool as the CUDA callbacks currently used cannot issue CUDA API calls - and in some cases the callback holds the last existing reference to an event. If I understand correctly, the the following PR is meant as a way towards a solution: #1471 Though the event problem does not fail tests, as the CUDA implementation "just" leaks the event in such cases and ignores the error... So it might be a different problem I am not aware off. |
||
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc | ||
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, fix LIT tests.