Skip to content

[SYCL][CUDA][USM] Improve CUDA USM memory allocation functions #1577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions sycl/include/CL/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -1298,7 +1298,8 @@ using pi_usm_migration_flags = _pi_usm_migration_flags;
/// \param context is the pi_context
/// \param pi_usm_mem_properties are optional allocation properties
/// \param size_t is the size of the allocation
/// \param alignment is the desired alignment of the allocation
/// \param alignment is the desired alignment of the allocation. 0 indicates no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, fix LIT tests.

/// requirements, and uses the backend default alignment.
pi_result piextUSMHostAlloc(void **result_ptr, pi_context context,
pi_usm_mem_properties *properties, size_t size,
pi_uint32 alignment);
Expand All @@ -1310,7 +1311,8 @@ pi_result piextUSMHostAlloc(void **result_ptr, pi_context context,
/// \param device is the device the memory will be allocated on
/// \param pi_usm_mem_properties are optional allocation properties
/// \param size_t is the size of the allocation
/// \param alignment is the desired alignment of the allocation
/// \param alignment is the desired alignment of the allocation. 0 indicates no
/// requirements, and uses the backend default alignment.
pi_result piextUSMDeviceAlloc(void **result_ptr, pi_context context,
pi_device device,
pi_usm_mem_properties *properties, size_t size,
Expand All @@ -1323,7 +1325,8 @@ pi_result piextUSMDeviceAlloc(void **result_ptr, pi_context context,
/// \param device is the device the memory will be allocated on
/// \param pi_usm_mem_properties are optional allocation properties
/// \param size_t is the size of the allocation
/// \param alignment is the desired alignment of the allocation
/// \param alignment is the desired alignment of the allocation. 0 indicates no
/// requirements, and uses the backend default alignment.
pi_result piextUSMSharedAlloc(void **result_ptr, pi_context context,
pi_device device,
pi_usm_mem_properties *properties, size_t size,
Expand All @@ -1340,8 +1343,7 @@ pi_result piextUSMFree(pi_context context, void *ptr);
/// \param queue is the queue to submit to
/// \param ptr is the ptr to memset
/// \param value is value to set. It is interpreted as an 8-bit value and the
/// upper
/// 24 bits are ignored
/// upper 24 bits are ignored
/// \param count is the size in bytes to memset
/// \param num_events_in_waitlist is the number of events to wait on
/// \param events_waitlist is an array of events to wait on
Expand Down
69 changes: 61 additions & 8 deletions sycl/plugins/cuda/pi_cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3396,17 +3396,29 @@ pi_result cuda_piEnqueueMemUnmap(pi_queue command_queue, pi_mem memobj,
pi_result cuda_piextUSMHostAlloc(void **result_ptr, pi_context context,
pi_usm_mem_properties *properties, size_t size,
pi_uint32 alignment) {
// from empirical testing with CUDA 10.2 on a Tesla K40
static constexpr pi_uint32 max_alignment = 0x200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to find any CUDA docs that specify the alignment but only finding queryable values for textures and the ominous "The allocated memory is suitably aligned for any kind of variable."...

Looking at CUDA types I wonder if that means that 16-byte alignment is the minimum (as I guess needed to align a double4 - see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to find any CUDA docs that specify the alignment but only finding queryable values for textures and the ominous "The allocated memory is suitably aligned for any kind of variable."...

I asked NVIDIA about it, they opened a ticket internally to clarify the documentation, but no idea of the time scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pordding them! 🤞 that they see the priority in it!


// enforce a valid pointer to the allocated memory
assert(result_ptr != nullptr);
// check the the context is valid
assert(context != nullptr);
// check that the property list is empty
assert(properties == nullptr);
// check that the alignment is not larger than max_alignment, and is either 0
// or a power of 2
assert(alignment <= max_alignment && (alignment & (alignment - 1)) == 0);

pi_result result = PI_SUCCESS;
try {
ScopedContext active(context);
result = PI_CHECK_ERROR(cuMemAllocHost(result_ptr, size));
} catch (pi_result error) {
result = error;
}
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
// check that the result is suitable aligned
assert((alignment == 0) ||
(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));
return result;
}

Expand All @@ -3416,18 +3428,31 @@ pi_result cuda_piextUSMDeviceAlloc(void **result_ptr, pi_context context,
pi_device device,
pi_usm_mem_properties *properties,
size_t size, pi_uint32 alignment) {
// from empirical testing with CUDA 10.2 on a Tesla K40
static constexpr pi_uint32 max_alignment = 0x200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the constant global to share it between the different functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering if host and device could have different default alignments - but given the lack of details, I guess we could use a global static for the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid thought!


// enforce a valid pointer to the allocated memory
assert(result_ptr != nullptr);
// check the the context is valid
assert(context != nullptr);
// check that the device is valid
assert(device != nullptr);
// check that the property list is empty
assert(properties == nullptr);
// check that the alignment is not larger than max_alignment, and is either 0
// or a power of 2
assert(alignment <= max_alignment && (alignment & (alignment - 1)) == 0);

pi_result result = PI_SUCCESS;
try {
ScopedContext active(context);
result = PI_CHECK_ERROR(cuMemAlloc((CUdeviceptr *)result_ptr, size));
} catch (pi_result error) {
result = error;
}
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
// check that the result is suitable aligned
assert((alignment == 0) ||
(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));
return result;
}

Expand All @@ -3437,10 +3462,21 @@ pi_result cuda_piextUSMSharedAlloc(void **result_ptr, pi_context context,
pi_device device,
pi_usm_mem_properties *properties,
size_t size, pi_uint32 alignment) {
// from empirical testing with CUDA 10.2 on a Tesla K40
static constexpr pi_uint32 max_alignment = 0x200;

// enforce a valid pointer to the allocated memory
assert(result_ptr != nullptr);
// check the the context is valid
assert(context != nullptr);
// check that the device is valid
assert(device != nullptr);
// check that the property list is empty
assert(properties == nullptr);
// check that the alignment is not larger than max_alignment, and is either 0
// or a power of 2
assert(alignment <= max_alignment && (alignment & (alignment - 1)) == 0);

pi_result result = PI_SUCCESS;
try {
ScopedContext active(context);
Expand All @@ -3449,7 +3485,9 @@ pi_result cuda_piextUSMSharedAlloc(void **result_ptr, pi_context context,
} catch (pi_result error) {
result = error;
}
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
// check that the result is suitable aligned
assert((alignment == 0) ||
(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));
return result;
}

Expand Down Expand Up @@ -3481,8 +3519,12 @@ pi_result cuda_piextUSMEnqueueMemset(pi_queue queue, void *ptr, pi_int32 value,
pi_uint32 num_events_in_waitlist,
const pi_event *events_waitlist,
pi_event *event) {
// enforce that the queue is valid
assert(queue != nullptr);
assert(ptr != nullptr);
// check that the pointer is valid
if (ptr == nullptr) {
return PI_INVALID_VALUE;
}
CUstream cuStream = queue->get();
pi_result result = PI_SUCCESS;
std::unique_ptr<_pi_event> event_ptr{nullptr};
Expand Down Expand Up @@ -3514,9 +3556,12 @@ pi_result cuda_piextUSMEnqueueMemcpy(pi_queue queue, pi_bool blocking,
pi_uint32 num_events_in_waitlist,
const pi_event *events_waitlist,
pi_event *event) {
// enforce that the queue is valid
assert(queue != nullptr);
assert(dst_ptr != nullptr);
assert(src_ptr != nullptr);
// check that the source and destination pointers are valid
if (dst_ptr == nullptr || src_ptr == nullptr) {
return PI_INVALID_VALUE;
}
CUstream cuStream = queue->get();
pi_result result = PI_SUCCESS;
std::unique_ptr<_pi_event> event_ptr{nullptr};
Expand Down Expand Up @@ -3553,8 +3598,12 @@ pi_result cuda_piextUSMEnqueuePrefetch(pi_queue queue, const void *ptr,
pi_uint32 num_events_in_waitlist,
const pi_event *events_waitlist,
pi_event *event) {
// enforce that the queue is valid
assert(queue != nullptr);
assert(ptr != nullptr);
// check that the pointer is valid
if (ptr == nullptr) {
return PI_INVALID_VALUE;
}
CUstream cuStream = queue->get();
pi_result result = PI_SUCCESS;
std::unique_ptr<_pi_event> event_ptr{nullptr};
Expand Down Expand Up @@ -3589,8 +3638,12 @@ pi_result cuda_piextUSMEnqueuePrefetch(pi_queue queue, const void *ptr,
pi_result cuda_piextUSMEnqueueMemAdvise(pi_queue queue, const void *ptr,
size_t length, int advice,
pi_event *event) {
// enforce that the queue is valid
assert(queue != nullptr);
assert(ptr != nullptr);
// check that the pointer is valid
if (ptr == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, return assert.

return PI_INVALID_VALUE;
}
// TODO implement a mapping to cuMemAdvise once the expected behaviour
// of piextUSMEnqueueMemAdvise is detailed in the USM extension
return cuda_piEnqueueEventsWait(queue, 0, nullptr, event);
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/allocator_vector.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can have a look later today or tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@bjoernknafla, just in case he might know if this is a known issue.

Copy link
Contributor

@bjoernknafla bjoernknafla May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error could come from CUDA event handling needing a thread pool as the CUDA callbacks currently used cannot issue CUDA API calls - and in some cases the callback holds the last existing reference to an event.

If I understand correctly, the the following PR is meant as a way towards a solution: #1471

Though the event problem does not fail tests, as the CUDA implementation "just" leaks the event in such cases and ignores the error... So it might be a different problem I am not aware off.

// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/allocator_vector_fail.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/allocatorll.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/badmalloc.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
// UNSUPPORTED: windows
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/depends_on.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/dmemll.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/hmemll.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/math.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
// RUN: %CPU_RUN_PLACEHOLDER %t.out

// REQUIRES: cpu
// XFAIL: cuda
// TODO: ptxas fatal : Unresolved extern function '_Z20__spirv_ocl_lgamma_rfPi'

#include <CL/sycl.hpp>
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/memadvise.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// SYCL runtime and piextUSM*Alloc functions for CUDA not behaving as described
// in: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
//
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/memcpy.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/memset.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/mixed.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/mixed2.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/mixed2template.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/mixed_queue.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/queue_wait.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down
1 change: 0 additions & 1 deletion sycl/test/usm/smemll.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
// https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
Expand Down