-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GDS-compatible allocator with 4k alignment. #3754
Conversation
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [4214574]: BUILD STARTED |
CI MESSAGE: [4214574]: BUILD PASSED |
@@ -32,6 +32,7 @@ void TestPoolResource(int num_iter) { | |||
test_host_resource upstream; | |||
{ | |||
auto opt = default_host_pool_opts(); | |||
opt.max_upstream_alignment = 32; // force the use of overaligned upstream allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do we need that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distribution of alignments in this test is 1-256. We need something smaller to test the codepath with overalignment. I want this to be explicit, in case the default value (which is 256) is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this is just for test. NVM
GDSAllocator::GDSAllocator() { | ||
// Currently, GPUDirect Storage can work only with memory allocated with cudaMalloc and | ||
// cuMemAlloc. Since DALI is transitioning to CUDA Virtual Memory Management for memory | ||
// allocation, we need a special allocator that's compatible with GDS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that achieved? Is it sufficient to just use coalescing_free_tree
(as I understand it still uses the CUDA Virtual Memory Management)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the next line:
static auto upstream = std::make_shared<mm::cuda_malloc_memory_resource>();
cuda_malloc_resource
uses plain cudaMalloc.
char *block_end = block_start + blk_size; | ||
assert(tail <= block_end); | ||
|
||
if (blk_size != bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (blk_size != bytes) { | |
if (blk_size > bytes) { |
According to what the comment says?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it can't be less :)
If anything, there could be an assert(blk_size > bytes);
inside, but wouldn't that be an overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's already indirectly tested as assert(tail <= block_end);
lock_guard guard(lock_); | ||
free_list_.put(static_cast<char *>(new_block) + bytes, blk_size - bytes); | ||
return new_block; | ||
if (ret != block_start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (ret != block_start) | |
if (ret > block_start) |
if (ret != block_start) | ||
free_list_.put(block_start, ret - block_start); | ||
|
||
if (tail != block_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (tail != block_end) | |
if (tail < block_end) |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz mzient@gmail.com
Category:
New feature (non-breaking change which adds functionality)
Description:
GDS is more efficient when mapped memory is aligned to a 4k boundary. It also doesn't work with memory allocated with cuMemCreate.
This PR adds a dedicated GDS memory pool, which is separate and used only for that purpose.
As a preparatory step, pool_resource was extended so it can be informed about maximum supported alignment of the upstream resource and work around that limitation when allocating upstream blocks.
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
N/A
Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-2667