-
-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA/HIP MultiGPU Event Polling #6284
Conversation
libs/core/async_cuda/include/hpx/async_cuda/detail/cuda_event_callback.hpp
Outdated
Show resolved
Hide resolved
@G-071 should we add a test that actually excersises this functionality, perhaps run it on the DGX partition only (e.g. enable it with a special CMake variable)? |
static cuda_event_pool& get_event_pool() | ||
{ | ||
static cuda_event_pool event_pool_; | ||
return event_pool_; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest that you move the (HPX_CORE_EXPORT
ed) implementation of this function into a source file? The rationale is that otherwise on some platforms (Mac, Windows) each executable module (shared library, executable) that will contain code calling this function will have its own instance of the event_pool_
variable.
I have added a few changes/additions/fixes:
Overall, this seems to work fine on the NVIDIA nodes I have tested in on so far (using Octo-Tiger). |
@G-071 could you please fix the inspect errors reported? After that I would be able to go ahead and merge the PR. |
@hkaiser Done! I just forgot about the noascii check (which does not like my name due to the ß and thus complains about me being in the author list of a file) and was missing a few headers. |
bors merge |
6280: Add TBB to HPX documentation in Migration Guide r=hkaiser a=dimitraka 6284: Add CUDA/HIP MultiGPU Event Polling r=hkaiser a=G-071 While the [cuda,hip]_executors within HPX support MultiGPU scenarios themselves (by using a device index and ```cudaSetDevice```), the event polling actually does not, forcing users to use the less performant callback version instead! To fix this, we need to make the CUDA/HIP event pool aware on which device the events it reuses are created, and add an index to always push/pop events on the correct device. This PR implements this, by adding multiple event stacks (one per device) to the event pool singleton. Furthermore, the PR adds an additional device_id parameter to the appropriate get_future calls that use events. To keep things compatible with code that used previous HPX versions, 0 is used as a default where this parameter was added. For each device, there's always 128 events added in the beginning by default - in case one wants to avoid this overhead when only a single GPU is used, simply set the environment variable CUDA_VISIBLE_DEVICES=0. Co-authored-by: dimitraka <kadimitra@ece.auth.gr> Co-authored-by: Gregor Daiss <Gregor.Daiss+git@gmail.com> Co-authored-by: Gregor Daiß <G-071@users.noreply.github.com>
This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried. Additional information: Response status code: 422
{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches"} |
While the [cuda,hip]_executors within HPX support MultiGPU scenarios themselves (by using a device index and
cudaSetDevice
), the event polling actually does not, forcing users to use the less performant callback version instead! To fix this, we need to make the CUDA/HIP event pool aware on which device the events it reuses are created, and add an index to always push/pop events on the correct device.This PR implements this, by adding multiple event stacks (one per device) to the event pool singleton. Furthermore, the PR adds an additional device_id parameter to the appropriate get_future calls that use events. To keep things compatible with code that used previous HPX versions, 0 is used as a default where this parameter was added. For each device, there's always 128 events added in the beginning by default - in case one wants to avoid this overhead when only a single GPU is used, simply set the environment variable CUDA_VISIBLE_DEVICES=0.