Add support for batched tasks and for CUDA-aware communications #4

bosilca · 2024-07-26T04:00:35Z

The idea for the batching is the following:

tasks incarnations (aka. BODY) can be marked with the "batch" property allowing the runtime to provide the task with the entire list of ready tasks of the execution stream instead of just extracting the head.
this list of ready tasks is in fact a ring, that can then be trimmed by the kernel and divided into batch and the rest. The rest of the tasks will be left in the ring, while the batch group will be submitted for execution.
the kernel also needs to provide a callback into the gpu_task complete_stage, such that the runtime can call the specialized function able to complete all batched tasks.

The idea for the CUDA-aware communications is to use the task class data_affinity capability to locate the affinity of a task, and then allocate all incoming data for that task into the preferred location of that data. Thus, in order to take advantage of this capability, it is critical to hint the runtime where to locate data (parsec_advise_data_on_device). Note, that only the first successor is checked for the affinity.

The communication thread allocates data on a GPU by bypassing the GPU allocator, and hitting directly into the zone_malloc (which is now protected against concurrent access). Once tiles are allocated they will be incorporated into the GPU LRUs by the first call involving them, so the GPU becomes aware of this data rather quickly. The issue is in the other direction, as parsec has a tendency to keep the temporary data in the cache, it will never refill the zone_malloc, leading to starvation on device memory for the communication thread. This is, I think, the root cause for the sporadic deadlocks I encounter.

I have started to implement a different approach, where the GPU memory can be extracted from LRUs by the communication thread. Part of the code is in this PR, but I did not had the time to integrate it nicely.

Anyway, last but not least, a word about performance. On a small (4 A100 GPU, IB but no nvlink) I get
no CUDA-aware: [****] TIME(s) 7.53502 : dpotrf PxQxg= 2 2 1 NB= 1024 N= 65536 : 12452.141424 gflops - ENQ&PROG&DEST 8.28404 : 11326.255177 gflops - ENQ 0.49800 - DEST 0.25101
CUDA-aware: [****] TIME(s) 3.80899 : dpotrf PxQxg= 2 2 1 NB= 1024 N= 65536 : 24633.106084 gflops - ENQ&PROG&DEST 4.56897 : 20535.733216 gflops - ENQ 0.51097 - DEST 0.24902

The idea is the following: - tasks incarnations (aka. BODY) can be marked with the "batch" property allowing the runtime to provide the task with the entire list of ready tasks of the execution stream instead of just extracting the head. - this list of ready tasks is in fact a ring, that can then be trimmed by the kernel and divided into batch and the rest. The rest of the tasks will be left in the ring, while the batch group will be submitted for execution. - the kernel also needs to provide a callback into the gpu_task complete_stage, such that the runtime can call the specialized function able to complete all batched tasks. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

The issue was that I forgot to clean the complete_stage after the callback, so it got called multiple times during the different completion stages of the task (completion of the execution and then later completion of the d2h transfers). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

The count is first, then the sizeof. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca · 2024-08-08T04:50:50Z

parsec/interfaces/ptg/ptg-compiler/jdf2c.c

@@ -672,6 +672,17 @@ static char* dump_local_assignments( void** elem, void* arg )
    if( dos > 0 ) {
        string_arena_init(info->sa);
        string_arena_add_string(info->sa, "const int %s = %s%s.value;", def->name, info->holder, def->name);
+#if 0


This is a leftover from another patch allowing typed locals. Will bring it in at a later date, meanwhile I will need to remove this code.

This is a multi-part patch that allows the CPU to prepare a data copy mapped onto a device. 1. The first question is how is such a device selected ? The allocation of such a copy happen way before the scheduler is invoked for a task, in fact before the task is even ready. Thus, we need to decide on the location of this copy only based on some static information, such as the task affinity. Therefore, this approach only works for owner-compute type of tasks, where the task will be executed on the device that owns the data used for the task affinity. 2. Pass the correct data copy across the entire system, instead of falling back to data copy of the device 0 (CPU memory) Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca · 2024-09-10T04:43:59Z

For now this PR remains here until all parts have been merged. The code has been split in 4 unrelated parts:

bosilca self-assigned this Jul 26, 2024

bosilca added 5 commits July 26, 2024 00:48

A small example pf batching.

b52a683

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Few fixes to the batch list manipulations.

eebe8c5

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Fix use of calloc.

c403e17

The count is first, then the sizeof. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Allow selection of a particular GPU (via the mask).

88bf42e

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/better_gpu_support branch from 194c274 to 88bf42e Compare July 28, 2024 20:20

bosilca added 2 commits July 29, 2024 09:35

Correctly add the unselected tasks back into the runtime.

733c05e

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Example of passing info to the batch callback.

a3aee18

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/better_gpu_support branch 2 times, most recently from 2eb2f04 to aedff44 Compare August 1, 2024 22:37

Add a CUDA-based RTT test.

d643750

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/better_gpu_support branch from aedff44 to 022e468 Compare August 2, 2024 06:26

Provide a way to do GPU masking.

bd35c5d

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/better_gpu_support branch 2 times, most recently from 559e0d5 to 8975e0c Compare August 4, 2024 21:25

Allow JDF with no dependencies, no datatype and no arenas.

9079ec6

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/better_gpu_support branch 4 times, most recently from 2c499e6 to 7c2c1a3 Compare August 6, 2024 04:03

bosilca changed the title ~~Add support for batched tasks.~~ Add support for batched tasks and for CUDA-aware communications Aug 8, 2024

bosilca commented Aug 8, 2024

View reviewed changes

bosilca added 3 commits August 8, 2024 15:25

Mostly improvement to the debuging output.

10ca380

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Add a configure option to enable GPU-aware communications.

b998764

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/better_gpu_support branch from 7f6bdd5 to b998764 Compare August 8, 2024 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for batched tasks and for CUDA-aware communications #4

Add support for batched tasks and for CUDA-aware communications #4

bosilca commented Jul 26, 2024 •

edited

Loading

bosilca Aug 8, 2024

bosilca commented Sep 10, 2024 •

edited

Loading

Add support for batched tasks and for CUDA-aware communications #4

Are you sure you want to change the base?

Add support for batched tasks and for CUDA-aware communications #4

Conversation

bosilca commented Jul 26, 2024 • edited Loading

bosilca Aug 8, 2024

Choose a reason for hiding this comment

bosilca commented Sep 10, 2024 • edited Loading

bosilca commented Jul 26, 2024 •

edited

Loading

bosilca commented Sep 10, 2024 •

edited

Loading