Add CUDA/HIP MultiGPU Event Polling #6284

G-071 · 2023-06-19T20:59:44Z

While the [cuda,hip]_executors within HPX support MultiGPU scenarios themselves (by using a device index and cudaSetDevice), the event polling actually does not, forcing users to use the less performant callback version instead! To fix this, we need to make the CUDA/HIP event pool aware on which device the events it reuses are created, and add an index to always push/pop events on the correct device.

This PR implements this, by adding multiple event stacks (one per device) to the event pool singleton. Furthermore, the PR adds an additional device_id parameter to the appropriate get_future calls that use events. To keep things compatible with code that used previous HPX versions, 0 is used as a default where this parameter was added. For each device, there's always 128 events added in the beginning by default - in case one wants to avoid this overhead when only a single GPU is used, simply set the environment variable CUDA_VISIBLE_DEVICES=0.

libs/core/async_cuda/include/hpx/async_cuda/cuda_event.hpp

libs/core/async_cuda/include/hpx/async_cuda/detail/cuda_event_callback.hpp

libs/core/async_cuda/include/hpx/async_cuda/cuda_event.hpp

hkaiser · 2023-06-21T20:54:15Z

@G-071 should we add a test that actually excersises this functionality, perhaps run it on the DGX partition only (e.g. enable it with a special CMake variable)?

hkaiser · 2023-06-26T20:56:17Z

libs/core/async_cuda/include/hpx/async_cuda/cuda_event.hpp

        static cuda_event_pool& get_event_pool()
        {
            static cuda_event_pool event_pool_;
            return event_pool_;
        }


May I suggest that you move the (HPX_CORE_EXPORTed) implementation of this function into a source file? The rationale is that otherwise on some platforms (Mac, Windows) each executable module (shared library, executable) that will contain code calling this function will have its own instance of the event_pool_ variable.

G-071 · 2023-06-29T01:49:24Z

I have added a few changes/additions/fixes:

The definition of the method get_event_pool is now in cuda_event.cpp. @hkaiser : Is this what you had in mind?
I have added a test (cuda_multi_device_polling.cpp) which tests executors on different devices. This tests queries the number of devices, creates an executor per device and runs a dummy kernel on each executor. If the multi device polling does not work, the futures for those dummy kernel calls will never become ready and the test will time out. As it adapts to the number of devices present, it will test with all 8 GPUs on the DGX node. The test also verifies that detail::get_future_with_event uses the currently active device (previousely set in the user code via cudaSetDevice) by default if no other device is given.
The thread that initializes the event pool now restores its original device ID afterwards. Without that, the thread used a different device ID that the other ones afterward (if multiple devices are present), causing problems in user code that only used 1 GPU. This is only done for the original initialization of the event pools for all devices. Subsequent events that are added on-demand for a device pool use the device ID that is passed by the user (either via detail::get_future_with_event directly or via the cuda executor). Meaning, if we get a non-zero device ID, the user code handles the setting of the devices already and we will simply use the passed device ID accordingly (saving us a bit of overhead by avoiding to call cudaGetDevice and cudaSetDevice). Edit: On second thought, with
46b477a this is now also done when adding events on-demand in pop. Makes it more difficult to use it the wrong way, and the overhead seems negligible after all.
Format fixes.

Overall, this seems to work fine on the NVIDIA nodes I have tested in on so far (using Octo-Tiger).

hkaiser · 2023-07-06T16:47:05Z

@G-071 could you please fix the inspect errors reported? After that I would be able to go ahead and merge the PR.

G-071 · 2023-07-06T19:10:16Z

@hkaiser Done! I just forgot about the noascii check (which does not like my name due to the ß and thus complains about me being in the author list of a file) and was missing a few headers.

hkaiser · 2023-07-08T15:48:03Z

bors merge

6280: Add TBB to HPX documentation in Migration Guide r=hkaiser a=dimitraka 6284: Add CUDA/HIP MultiGPU Event Polling r=hkaiser a=G-071 While the [cuda,hip]_executors within HPX support MultiGPU scenarios themselves (by using a device index and ```cudaSetDevice```), the event polling actually does not, forcing users to use the less performant callback version instead! To fix this, we need to make the CUDA/HIP event pool aware on which device the events it reuses are created, and add an index to always push/pop events on the correct device. This PR implements this, by adding multiple event stacks (one per device) to the event pool singleton. Furthermore, the PR adds an additional device_id parameter to the appropriate get_future calls that use events. To keep things compatible with code that used previous HPX versions, 0 is used as a default where this parameter was added. For each device, there's always 128 events added in the beginning by default - in case one wants to avoid this overhead when only a single GPU is used, simply set the environment variable CUDA_VISIBLE_DEVICES=0. Co-authored-by: dimitraka <kadimitra@ece.auth.gr> Co-authored-by: Gregor Daiss <Gregor.Daiss+git@gmail.com> Co-authored-by: Gregor Daiß <G-071@users.noreply.github.com>

bors · 2023-07-08T16:20:02Z

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

Response status code: 422
{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

G-071 added 4 commits June 19, 2023 14:59

Initial hardcoded version

9f6dfca

Detect number of GPUs

e91c81a

Cleanup

25494a2

Add missing default value

64ecdfc

G-071 requested review from aurianer, biddisco, hkaiser and msimberg as code owners June 19, 2023 20:59

G-071 added 2 commits June 19, 2023 16:10

Fix format

12b3e3c

Add missing format fix

bf85ef6

hkaiser added category: core type: enhancement category: threadmanager type: compatibility issue platform: CUDA platform: HIP labels Jun 19, 2023

hkaiser added this to the 1.10.0 milestone Jun 19, 2023

hkaiser reviewed Jun 19, 2023

View reviewed changes

G-071 added 4 commits June 19, 2023 18:55

Add default device arguments

6a24f16

Fix asserts

e665a9e

Delete copy/move event pool constructors

791cecd

Fix unused parameter warning

d93ada2

hkaiser reviewed Jun 21, 2023

View reviewed changes

libs/core/async_cuda/include/hpx/async_cuda/cuda_event.hpp Outdated Show resolved Hide resolved

hkaiser modified the milestones: 1.10.0, 1.9.1 Jun 21, 2023

hkaiser mentioned this pull request Jun 21, 2023

PRs to be merged for v1.9.1 point release #6244

Closed

33 tasks

G-071 and others added 3 commits June 22, 2023 21:31

Switch to east const

7a5f551

Add -1 default parameter for device auto-detection

b37f72a

Merge branch 'STEllAR-GROUP:master' into add-multigpu-polling

cf60025

hkaiser reviewed Jun 26, 2023

View reviewed changes

G-071 and others added 10 commits June 27, 2023 10:11

Merge branch 'STEllAR-GROUP:master' into add-multigpu-polling

df94966

Fix assert

d1a81f0

Restore original device after init

d1f5953

Remove superfluous cudaSetDevice

9b3388e

Put event pool singleton access definition in src

ee33c48

Add basic multi gpu polling test

707c025

Fix assert (again)

8d7f636

Add test for default device ID

a7e76e5

Fix some format issues / update file author lists

525ad21

Fix test format

3999ecc

G-071 mentioned this pull request Jun 29, 2023

Add MultiGPU Support STEllAR-GROUP/octotiger#450

Merged

G-071 added 3 commits June 28, 2023 22:57

Cal setdevice when creating events on-demand

46b477a

Fix format (again)

4eb0b39

Remove superfluous api call

e547c24

Fix inspect

9f0d330

hkaiser merged commit 75faae5 into STEllAR-GROUP:master Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA/HIP MultiGPU Event Polling #6284

Add CUDA/HIP MultiGPU Event Polling #6284

G-071 commented Jun 19, 2023

hkaiser commented Jun 21, 2023

hkaiser Jun 26, 2023

G-071 commented Jun 29, 2023 •

edited

Loading

hkaiser commented Jul 6, 2023

G-071 commented Jul 6, 2023

hkaiser commented Jul 8, 2023

bors bot commented Jul 8, 2023

Add CUDA/HIP MultiGPU Event Polling #6284

Add CUDA/HIP MultiGPU Event Polling #6284

Conversation

G-071 commented Jun 19, 2023

hkaiser commented Jun 21, 2023

hkaiser Jun 26, 2023

Choose a reason for hiding this comment

G-071 commented Jun 29, 2023 • edited Loading

hkaiser commented Jul 6, 2023

G-071 commented Jul 6, 2023

hkaiser commented Jul 8, 2023

bors bot commented Jul 8, 2023

G-071 commented Jun 29, 2023 •

edited

Loading