-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Less cudaGet/SetDevice calls in Gluon execution #13764
Less cudaGet/SetDevice calls in Gluon execution #13764
Conversation
@ptrendx Can you look into failing CI builds? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! One quesiton
for (int i = 0; i < n; ++i) { | ||
device_store.SetDevice(gpus[i]); | ||
// Restores active device to what it was before EnableP2P | ||
mxnet::common::cuda::DeviceStore device_store(gpus[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is cudaGetDevice
costly? This change would cause 2x cudaGetDevice
calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is executed only during initialization, so I'm not concerned about its performance (to answer your question though - cudaGetDevice is slightly less costly than cudaSetDevice).
I made a change here just because it is then real RAII guard instead of just a setdevice call.
@ctcyang could you also take a look? |
Not sure why the website check is showing as pending - it seems to have finished successfully in Details view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! The only nitpick I have is that after these changes, the only place where cudaSetDevice
is still used directly is: https://github.com/apache/incubator-mxnet/blob/e9a7aa42ec380d92b1623025d6434b8856724402/src/engine/threaded_engine_pooled.cc#L136
Could you change that to use this new API too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
Description
This PR reduces the number of cudaGetDevice/cudaSetDevice calls during Gluon execution.
Previously, during every call to allocate/free buffer in StorageManager DeviceStore would call cudaGetDevice and 2x cudaSetDevice (to get the current device, set the new device and lastly to set the original device again), even if no actual allocation took place (due to caching allocator usage).
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments