-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Do not touch GPU 0 during ReleaseAll #14550
Conversation
src/storage/pooled_storage_manager.h
Outdated
@@ -54,10 +54,11 @@ class GPUPooledStorageManager final : public StorageManager { | |||
/*! | |||
* \brief Default constructor. | |||
*/ | |||
GPUPooledStorageManager() { | |||
GPUPooledStorageManager(int dev_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pass the context instead? And call it creation_ or initial_context? And maybe add a short explanation to the parameters documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@mxnet-label-bot add [Backend, pr-awaiting-review] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change dev_id_
to initial_context_
in the PR description. Otherwise, LGTM.
src/storage/pooled_storage_manager.h
Outdated
@@ -123,6 +126,8 @@ class GPUPooledStorageManager final : public StorageManager { | |||
int reserve_; | |||
// number of devices | |||
const size_t NDEV = 32; | |||
// context used by this Storage Manager | |||
Context initial_context_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment: make it const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/storage/pooled_storage_manager.h
Outdated
@@ -290,6 +299,8 @@ class GPUPooledRoundedStorageManager final : public StorageManager { | |||
size_t cut_off_; | |||
// percentage of reserved memory | |||
int reserve_; | |||
// context used by this Storage Manager | |||
Context initial_context_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The CI failure seems completely unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Retriggered the CI for you and let's see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I also encountered this bug! Thanks for the fix.
@mxnet-label-bot update [Backend, pr-awaiting-merge] |
The PR has been merged. Thank you for the fix! |
* Do not touch GPU 0 during ReleaseAll * Fixing lint and fixes from review * Fix * Fixes from review
* Do not touch GPU 0 during ReleaseAll * Fixing lint and fixes from review * Fix * Fixes from review
* Do not touch GPU 0 during ReleaseAll * Fixing lint and fixes from review * Fix * Fixes from review
Description
Currently, during call to ReleaseAll (either in pooled allocator destructor or when the memory is full and the allocator needs to release cached allocations) the memory handle is not fully populated (only ptr and size are populated, not context information), which makes the DirectFreeNoLock method to switch to GPU 0 (since 0 is the default constructed Context's dev_id). In the case of multi process training (e.g. with Horovod) this could produce Out of Memory errors due to context creation on GPU 0, or it could outright fail, crashing the application, if the exclusive mode is set on GPU 0.
This PR fixes that by adding
initial_context_
parameter to pooled storage managers and populating the handle with it in ReleaseAll method.@apeforest @yuxihu FYI
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.