Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Do not touch GPU 0 during ReleaseAll #14550

Merged
merged 4 commits into from
Mar 31, 2019
Merged

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Mar 28, 2019

Description

Currently, during call to ReleaseAll (either in pooled allocator destructor or when the memory is full and the allocator needs to release cached allocations) the memory handle is not fully populated (only ptr and size are populated, not context information), which makes the DirectFreeNoLock method to switch to GPU 0 (since 0 is the default constructed Context's dev_id). In the case of multi process training (e.g. with Horovod) this could produce Out of Memory errors due to context creation on GPU 0, or it could outright fail, crashing the application, if the exclusive mode is set on GPU 0.

This PR fixes that by adding initial_context_ parameter to pooled storage managers and populating the handle with it in ReleaseAll method.

@apeforest @yuxihu FYI

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • Code is well-documented:
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@@ -54,10 +54,11 @@ class GPUPooledStorageManager final : public StorageManager {
/*!
* \brief Default constructor.
*/
GPUPooledStorageManager() {
GPUPooledStorageManager(int dev_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pass the context instead? And call it creation_ or initial_context? And maybe add a short explanation to the parameters documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@abhinavs95
Copy link
Contributor

@mxnet-label-bot add [Backend, pr-awaiting-review]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet pr-awaiting-review PR is waiting for code review labels Mar 28, 2019
Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change dev_id_ to initial_context_ in the PR description. Otherwise, LGTM.

@@ -123,6 +126,8 @@ class GPUPooledStorageManager final : public StorageManager {
int reserve_;
// number of devices
const size_t NDEV = 32;
// context used by this Storage Manager
Context initial_context_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment: make it const?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -290,6 +299,8 @@ class GPUPooledRoundedStorageManager final : public StorageManager {
size_t cut_off_;
// percentage of reserved memory
int reserve_;
// context used by this Storage Manager
Context initial_context_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ptrendx
Copy link
Member Author

ptrendx commented Mar 29, 2019

The CI failure seems completely unrelated.

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Retriggered the CI for you and let's see.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I also encountered this bug! Thanks for the fix.

@yuxihu
Copy link
Member

yuxihu commented Mar 29, 2019

@mxnet-label-bot update [Backend, pr-awaiting-merge]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Mar 29, 2019
@wkcn wkcn merged commit 4075212 into apache:master Mar 31, 2019
@wkcn
Copy link
Member

wkcn commented Mar 31, 2019

The PR has been merged. Thank you for the fix!

ZhennanQin pushed a commit to ZhennanQin/incubator-mxnet that referenced this pull request Apr 3, 2019
* Do not touch GPU 0 during ReleaseAll

* Fixing lint and fixes from review

* Fix

* Fixes from review
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Do not touch GPU 0 during ReleaseAll

* Fixing lint and fixes from review

* Fix

* Fixes from review
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Do not touch GPU 0 during ReleaseAll

* Fixing lint and fixes from review

* Fix

* Fixes from review
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants