Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA/HIP MultiGPU Event Polling #6284

Merged
merged 27 commits into from
Jul 8, 2023

Conversation

G-071
Copy link
Member

@G-071 G-071 commented Jun 19, 2023

While the [cuda,hip]_executors within HPX support MultiGPU scenarios themselves (by using a device index and cudaSetDevice), the event polling actually does not, forcing users to use the less performant callback version instead! To fix this, we need to make the CUDA/HIP event pool aware on which device the events it reuses are created, and add an index to always push/pop events on the correct device.

This PR implements this, by adding multiple event stacks (one per device) to the event pool singleton. Furthermore, the PR adds an additional device_id parameter to the appropriate get_future calls that use events. To keep things compatible with code that used previous HPX versions, 0 is used as a default where this parameter was added. For each device, there's always 128 events added in the beginning by default - in case one wants to avoid this overhead when only a single GPU is used, simply set the environment variable CUDA_VISIBLE_DEVICES=0.

@hkaiser
Copy link
Member

hkaiser commented Jun 21, 2023

@G-071 should we add a test that actually excersises this functionality, perhaps run it on the DGX partition only (e.g. enable it with a special CMake variable)?

@hkaiser hkaiser modified the milestones: 1.10.0, 1.9.1 Jun 21, 2023
Comment on lines 26 to 30
static cuda_event_pool& get_event_pool()
{
static cuda_event_pool event_pool_;
return event_pool_;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest that you move the (HPX_CORE_EXPORTed) implementation of this function into a source file? The rationale is that otherwise on some platforms (Mac, Windows) each executable module (shared library, executable) that will contain code calling this function will have its own instance of the event_pool_ variable.

@G-071
Copy link
Member Author

G-071 commented Jun 29, 2023

I have added a few changes/additions/fixes:

  • The definition of the method get_event_pool is now in cuda_event.cpp. @hkaiser : Is this what you had in mind?
  • I have added a test (cuda_multi_device_polling.cpp) which tests executors on different devices. This tests queries the number of devices, creates an executor per device and runs a dummy kernel on each executor. If the multi device polling does not work, the futures for those dummy kernel calls will never become ready and the test will time out. As it adapts to the number of devices present, it will test with all 8 GPUs on the DGX node. The test also verifies that detail::get_future_with_event uses the currently active device (previousely set in the user code via cudaSetDevice) by default if no other device is given.
  • The thread that initializes the event pool now restores its original device ID afterwards. Without that, the thread used a different device ID that the other ones afterward (if multiple devices are present), causing problems in user code that only used 1 GPU. This is only done for the original initialization of the event pools for all devices. Subsequent events that are added on-demand for a device pool use the device ID that is passed by the user (either via detail::get_future_with_event directly or via the cuda executor). Meaning, if we get a non-zero device ID, the user code handles the setting of the devices already and we will simply use the passed device ID accordingly (saving us a bit of overhead by avoiding to call cudaGetDevice and cudaSetDevice). Edit: On second thought, with
    46b477a this is now also done when adding events on-demand in pop. Makes it more difficult to use it the wrong way, and the overhead seems negligible after all.
  • Format fixes.

Overall, this seems to work fine on the NVIDIA nodes I have tested in on so far (using Octo-Tiger).

@hkaiser
Copy link
Member

hkaiser commented Jul 6, 2023

@G-071 could you please fix the inspect errors reported? After that I would be able to go ahead and merge the PR.

@G-071
Copy link
Member Author

G-071 commented Jul 6, 2023

@hkaiser Done! I just forgot about the noascii check (which does not like my name due to the ß and thus complains about me being in the author list of a file) and was missing a few headers.

@hkaiser
Copy link
Member

hkaiser commented Jul 8, 2023

bors merge

bors bot pushed a commit that referenced this pull request Jul 8, 2023
6280: Add TBB to HPX documentation in Migration Guide r=hkaiser a=dimitraka



6284: Add CUDA/HIP MultiGPU Event Polling r=hkaiser a=G-071

While the [cuda,hip]_executors within HPX support MultiGPU scenarios themselves (by using a device index and ```cudaSetDevice```), the event polling actually does not, forcing users to use the less performant callback version instead! To fix this, we need to make the CUDA/HIP event pool aware on which device the events it reuses are created, and add an index to always push/pop events on the correct device. 

This PR implements this, by adding multiple event stacks (one per device) to the event pool singleton. Furthermore, the PR adds an additional device_id parameter to the appropriate get_future calls that use events. To keep things compatible with code that used previous HPX versions, 0 is used as a default where this parameter was added. For each device, there's always 128 events added in the beginning by default - in case one wants to avoid this overhead when only a single GPU is used, simply set the environment variable CUDA_VISIBLE_DEVICES=0.

Co-authored-by: dimitraka <kadimitra@ece.auth.gr>
Co-authored-by: Gregor Daiss <Gregor.Daiss+git@gmail.com>
Co-authored-by: Gregor Daiß <G-071@users.noreply.github.com>
@bors
Copy link

bors bot commented Jul 8, 2023

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

Response status code: 422
{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

@hkaiser hkaiser merged commit 75faae5 into STEllAR-GROUP:master Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants