Skip to content

[SYCL][CUDA][USM] Improve CUDA USM memory allocation functions #1577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Apr 23, 2020

Allow memory allocations with 0 alignment, to signify no alignment requirements.

Return PI_INVALID_VALUE for memory operations on a nullptr, instead of failing with an assert.

@fwyzard fwyzard requested a review from smaslov-intel as a code owner April 23, 2020 13:47
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 23, 2020

Fixes #1467 ?

@romanovvlad
Copy link
Contributor

Strictly speaking alignment == 0 doesn't mean there is no requirement. It means the alignment should be equal to the largest type supported by OpenCL(CUDA in our case).
@jbrodman Could you please comment?

@jbrodman
Copy link
Contributor

Strictly speaking alignment == 0 doesn't mean there is no requirement. It means the alignment should be equal to the largest type supported by OpenCL(CUDA in our case).
@jbrodman Could you please comment?

The intent of alignment == 0 is "just do the default thing" whatever that may be.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 23, 2020

In USM.adoc#aligned_alloc I see only

size_t alignment - specifies the byte alignment. Must be a valid alignment supported by the implementation

for the various allocation functions.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 23, 2020

Ah, in cl_intel_unified_shared_memory.asciidoc I found

alignment is the minimum alignment in bytes for the requested host allocation. It must be a power of two and must be equal to or smaller than the size of the largest data type supported by any OpenCL device in context. If alignment is 0, a default alignment will be used that is equal to the size of the largest data type supported by any OpenCL device in context.

So, should the functions in pi_cuda.cpp follow the same guidelines about valid values and error reporting ?

For CUDA I could set the maximum alignment to 1024 (empirical value) or at least 128 (CUDA cache line size).

@@ -3406,7 +3406,7 @@ pi_result cuda_piextUSMHostAlloc(void **result_ptr, pi_context context,
} catch (pi_result error) {
result = error;
}
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
assert((alignment == 0) or (reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert((alignment == 0) or (reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));
assert((alignment == 0) || (reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));

"or" confuses a lot of people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm.. OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason to do this change is avoid build issues. See #1501.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 23, 2020

Actually, after checking on a different machine, it looks like the cuMemAlloc and cuMemAllocHost return memory aligned to 512 bytes (0x200) .

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 23, 2020

I took the opportunity of changing most checks from asserts to return the appropriate error value.

@fwyzard fwyzard changed the title [SYCL][CUDA][USM] Allow unaligned memory allocations [SYCL][CUDA][USM] Improve CUDA USM memory allocation functions Apr 23, 2020
@bader bader requested a review from romanovvlad April 23, 2020 15:58
@smaslov-intel
Copy link
Contributor

Strictly speaking alignment == 0 doesn't mean there is no requirement. It means the alignment should be equal to the largest type supported by OpenCL(CUDA in our case).
@jbrodman Could you please comment?

The intent of alignment == 0 is "just do the default thing" whatever that may be.

@fwyzard , please add this in the comment about USM alignment in corresponding API in pi.h

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 24, 2020

Does it look reasonable ?

assert(context != nullptr);
assert(properties == nullptr);
// check the the context is valid
if (context == nullptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, keep asserts. It's expected that the higher level passes right arguments, otherwise it is an internal bug.

smaslov-intel
smaslov-intel previously approved these changes Apr 24, 2020
smaslov-intel
smaslov-intel previously approved these changes Apr 24, 2020
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 24, 2020

Sorry, updated to apply the clang-format patch.

smaslov-intel
smaslov-intel previously approved these changes Apr 25, 2020
bader
bader previously approved these changes May 1, 2020
smaslov-intel
smaslov-intel previously approved these changes May 2, 2020
@bader bader requested a review from romanovvlad May 2, 2020 08:27
@fwyzard
Copy link
Contributor Author

fwyzard commented May 6, 2020

@bader could you trigger the Lit_With_Cuda test ?

@bader
Copy link
Contributor

bader commented May 6, 2020

@tfzhu, @vladimirlaz, could you trigger the Lit_With_Cuda test for this PR, please?

assert(queue != nullptr);
assert(ptr != nullptr);
// check that the pointer is valid
if (ptr == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, return assert.

@@ -3396,17 +3396,29 @@ pi_result cuda_piEnqueueMemUnmap(pi_queue command_queue, pi_mem memobj,
pi_result cuda_piextUSMHostAlloc(void **result_ptr, pi_context context,
pi_usm_mem_properties *properties, size_t size,
pi_uint32 alignment) {
// from empirical testing with CUDA 10.2 on a Tesla K40
static constexpr pi_uint32 max_alignment = 0x200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to find any CUDA docs that specify the alignment but only finding queryable values for textures and the ominous "The allocated memory is suitably aligned for any kind of variable."...

Looking at CUDA types I wonder if that means that 16-byte alignment is the minimum (as I guess needed to align a double4 - see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to find any CUDA docs that specify the alignment but only finding queryable values for textures and the ominous "The allocated memory is suitably aligned for any kind of variable."...

I asked NVIDIA about it, they opened a ticket internally to clarify the documentation, but no idea of the time scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pordding them! 🤞 that they see the priority in it!

@@ -3416,18 +3428,31 @@ pi_result cuda_piextUSMDeviceAlloc(void **result_ptr, pi_context context,
pi_device device,
pi_usm_mem_properties *properties,
size_t size, pi_uint32 alignment) {
// from empirical testing with CUDA 10.2 on a Tesla K40
static constexpr pi_uint32 max_alignment = 0x200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the constant global to share it between the different functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering if host and device could have different default alignments - but given the lack of details, I guess we could use a global static for the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid thought!

@bjoernknafla
Copy link
Contributor

Awesome to see the XFAILs go for USM!

@bjoernknafla
Copy link
Contributor

Is this work realted to #1603 (as that issue puzzles me)?

@fwyzard
Copy link
Contributor Author

fwyzard commented May 7, 2020

Yes, as I added at the bottom there, most (all?) CUDA USM XFAILs are fixed by this PR.

My bad for not noticing it when I made the initial implementation: I had asserts disabled in my build, which hid most of the failures.

@@ -1,4 +1,3 @@
// XFAIL: cuda
// piextUSM*Alloc functions for CUDA are not behaving as described in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can have a look later today or tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@bjoernknafla, just in case he might know if this is a known issue.

Copy link
Contributor

@bjoernknafla bjoernknafla May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error could come from CUDA event handling needing a thread pool as the CUDA callbacks currently used cannot issue CUDA API calls - and in some cases the callback holds the last existing reference to an event.

If I understand correctly, the the following PR is meant as a way towards a solution: #1471

Though the event problem does not fail tests, as the CUDA implementation "just" leaks the event in such cases and ignores the error... So it might be a different problem I am not aware off.

@jbrodman
Copy link
Contributor

Any update on where we are with this?

@jeffhammond
Copy link
Contributor

I am trying to test this PR and had the crazy idea to rebase it on top of the latest sycl branch, which is taking forever. It would be really nice if the maintainers could deal with this, since #1467 and this PR have been open since April.

@romanovvlad romanovvlad self-requested a review August 26, 2020 13:50
@@ -1298,7 +1298,8 @@ using pi_usm_migration_flags = _pi_usm_migration_flags;
/// \param context is the pi_context
/// \param pi_usm_mem_properties are optional allocation properties
/// \param size_t is the size of the allocation
/// \param alignment is the desired alignment of the allocation
/// \param alignment is the desired alignment of the allocation. 0 indicates no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, fix LIT tests.

@bader
Copy link
Contributor

bader commented Apr 21, 2021

@fwyzard, I think USM memory allocation functions are working on CUDA. Can we close this pull request?

@fwyzard fwyzard changed the base branch from sycl to intel April 21, 2021 07:53
@fwyzard fwyzard dismissed stale reviews from smaslov-intel and bader April 21, 2021 07:53

The base branch was changed.

@fwyzard fwyzard changed the base branch from intel to sycl April 21, 2021 07:53
@fwyzard fwyzard requested a review from a team as a code owner April 21, 2021 07:53
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 21, 2021

@bader unfortunately I won't have time to look back at this for the next weeks.
From a quick look here on GitHub it seems that some of the changes could still be useful - but it would take more work to rebase them, address the conflicts, retest, etc.

I'm OK with closing the PR, and eventually resurrect it when I can work on it again.

@fwyzard fwyzard closed this Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants