Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for batched tasks and for CUDA-aware communications #4

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

bosilca
Copy link
Owner

@bosilca bosilca commented Jul 26, 2024

The idea for the batching is the following:

  • tasks incarnations (aka. BODY) can be marked with the "batch" property allowing the runtime to provide the task with the entire list of ready tasks of the execution stream instead of just extracting the head.
  • this list of ready tasks is in fact a ring, that can then be trimmed by the kernel and divided into batch and the rest. The rest of the tasks will be left in the ring, while the batch group will be submitted for execution.
  • the kernel also needs to provide a callback into the gpu_task complete_stage, such that the runtime can call the specialized function able to complete all batched tasks.

The idea for the CUDA-aware communications is to use the task class data_affinity capability to locate the affinity of a task, and then allocate all incoming data for that task into the preferred location of that data. Thus, in order to take advantage of this capability, it is critical to hint the runtime where to locate data (parsec_advise_data_on_device). Note, that only the first successor is checked for the affinity.

The communication thread allocates data on a GPU by bypassing the GPU allocator, and hitting directly into the zone_malloc (which is now protected against concurrent access). Once tiles are allocated they will be incorporated into the GPU LRUs by the first call involving them, so the GPU becomes aware of this data rather quickly. The issue is in the other direction, as parsec has a tendency to keep the temporary data in the cache, it will never refill the zone_malloc, leading to starvation on device memory for the communication thread. This is, I think, the root cause for the sporadic deadlocks I encounter.

I have started to implement a different approach, where the GPU memory can be extracted from LRUs by the communication thread. Part of the code is in this PR, but I did not had the time to integrate it nicely.

Anyway, last but not least, a word about performance. On a small (4 A100 GPU, IB but no nvlink) I get
no CUDA-aware: [****] TIME(s) 7.53502 : dpotrf PxQxg= 2 2 1 NB= 1024 N= 65536 : 12452.141424 gflops - ENQ&PROG&DEST 8.28404 : 11326.255177 gflops - ENQ 0.49800 - DEST 0.25101
CUDA-aware: [****] TIME(s) 3.80899 : dpotrf PxQxg= 2 2 1 NB= 1024 N= 65536 : 24633.106084 gflops - ENQ&PROG&DEST 4.56897 : 20535.733216 gflops - ENQ 0.51097 - DEST 0.24902

The idea is the following:
- tasks incarnations (aka. BODY) can be marked with the "batch" property
  allowing the runtime to provide the task with the entire list of ready
  tasks of the execution stream instead of just extracting the head.
- this list of ready tasks is in fact a ring, that can then be trimmed
  by the kernel and divided into batch and the rest. The rest of the
  tasks will be left in the ring, while the batch group will be
  submitted for execution.
- the kernel also needs to provide a callback into the gpu_task
  complete_stage, such that the runtime can call the specialized
  function able to complete all batched tasks.

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca self-assigned this Jul 26, 2024
bosilca added 5 commits July 26, 2024 00:48
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
The issue was that I forgot to clean the complete_stage after the
callback, so it got called multiple times during the different
completion stages of the task (completion of the execution and then
later completion of the d2h transfers).

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
The count is first, then the sizeof.

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca force-pushed the topic/better_gpu_support branch from 194c274 to 88bf42e Compare July 28, 2024 20:20
bosilca added 2 commits July 29, 2024 09:35
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca force-pushed the topic/better_gpu_support branch 2 times, most recently from 2eb2f04 to aedff44 Compare August 1, 2024 22:37
This allows to check if the data can be send and received directly to
and from GPU buffers.

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca force-pushed the topic/better_gpu_support branch from aedff44 to 022e468 Compare August 2, 2024 06:26
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca force-pushed the topic/better_gpu_support branch 2 times, most recently from 559e0d5 to 8975e0c Compare August 4, 2024 21:25
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca force-pushed the topic/better_gpu_support branch 4 times, most recently from 2c499e6 to 7c2c1a3 Compare August 6, 2024 04:03
@bosilca bosilca changed the title Add support for batched tasks. Add support for batched tasks and for CUDA-aware communications Aug 8, 2024
@@ -672,6 +672,17 @@ static char* dump_local_assignments( void** elem, void* arg )
if( dos > 0 ) {
string_arena_init(info->sa);
string_arena_add_string(info->sa, "const int %s = %s%s.value;", def->name, info->holder, def->name);
#if 0
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leftover from another patch allowing typed locals. Will bring it in at a later date, meanwhile I will need to remove this code.

bosilca added 3 commits August 8, 2024 15:25
This is a multi-part patch that allows the CPU to prepare a data copy
mapped onto a device.

1. The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

2. Pass the correct data copy across the entire system, instead of
   falling back to data copy of the device 0 (CPU memory)

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Name the data_t allocated for temporaries allowing developers to track
them through the execution. Add the keys to all outputs (tasks and
copies).

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
@bosilca bosilca force-pushed the topic/better_gpu_support branch from 7f6bdd5 to b998764 Compare August 8, 2024 22:26
@bosilca
Copy link
Owner Author

bosilca commented Sep 10, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant