forked from ICLDisco/parsec
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for batched tasks and for CUDA-aware communications #4
Open
bosilca
wants to merge
14
commits into
master
Choose a base branch
from
topic/better_gpu_support
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The idea is the following: - tasks incarnations (aka. BODY) can be marked with the "batch" property allowing the runtime to provide the task with the entire list of ready tasks of the execution stream instead of just extracting the head. - this list of ready tasks is in fact a ring, that can then be trimmed by the kernel and divided into batch and the rest. The rest of the tasks will be left in the ring, while the batch group will be submitted for execution. - the kernel also needs to provide a callback into the gpu_task complete_stage, such that the runtime can call the specialized function able to complete all batched tasks. Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
The issue was that I forgot to clean the complete_stage after the callback, so it got called multiple times during the different completion stages of the task (completion of the execution and then later completion of the d2h transfers). Signed-off-by: George Bosilca <gbosilca@nvidia.com>
The count is first, then the sizeof. Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
bosilca
force-pushed
the
topic/better_gpu_support
branch
from
July 28, 2024 20:20
194c274
to
88bf42e
Compare
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
bosilca
force-pushed
the
topic/better_gpu_support
branch
2 times, most recently
from
August 1, 2024 22:37
2eb2f04
to
aedff44
Compare
This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>
bosilca
force-pushed
the
topic/better_gpu_support
branch
from
August 2, 2024 06:26
aedff44
to
022e468
Compare
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
bosilca
force-pushed
the
topic/better_gpu_support
branch
2 times, most recently
from
August 4, 2024 21:25
559e0d5
to
8975e0c
Compare
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
bosilca
force-pushed
the
topic/better_gpu_support
branch
4 times, most recently
from
August 6, 2024 04:03
2c499e6
to
7c2c1a3
Compare
bosilca
changed the title
Add support for batched tasks.
Add support for batched tasks and for CUDA-aware communications
Aug 8, 2024
bosilca
commented
Aug 8, 2024
@@ -672,6 +672,17 @@ static char* dump_local_assignments( void** elem, void* arg ) | |||
if( dos > 0 ) { | |||
string_arena_init(info->sa); | |||
string_arena_add_string(info->sa, "const int %s = %s%s.value;", def->name, info->holder, def->name); | |||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a leftover from another patch allowing typed locals. Will bring it in at a later date, meanwhile I will need to remove this code.
This is a multi-part patch that allows the CPU to prepare a data copy mapped onto a device. 1. The first question is how is such a device selected ? The allocation of such a copy happen way before the scheduler is invoked for a task, in fact before the task is even ready. Thus, we need to decide on the location of this copy only based on some static information, such as the task affinity. Therefore, this approach only works for owner-compute type of tasks, where the task will be executed on the device that owns the data used for the task affinity. 2. Pass the correct data copy across the entire system, instead of falling back to data copy of the device 0 (CPU memory) Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
bosilca
force-pushed
the
topic/better_gpu_support
branch
from
August 8, 2024 22:26
7f6bdd5
to
b998764
Compare
For now this PR remains here until all parts have been merged. The code has been split in 4 unrelated parts: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The idea for the batching is the following:
The idea for the CUDA-aware communications is to use the task class
data_affinity
capability to locate the affinity of a task, and then allocate all incoming data for that task into the preferred location of that data. Thus, in order to take advantage of this capability, it is critical to hint the runtime where to locate data (parsec_advise_data_on_device
). Note, that only the first successor is checked for the affinity.The communication thread allocates data on a GPU by bypassing the GPU allocator, and hitting directly into the zone_malloc (which is now protected against concurrent access). Once tiles are allocated they will be incorporated into the GPU LRUs by the first call involving them, so the GPU becomes aware of this data rather quickly. The issue is in the other direction, as parsec has a tendency to keep the temporary data in the cache, it will never refill the zone_malloc, leading to starvation on device memory for the communication thread. This is, I think, the root cause for the sporadic deadlocks I encounter.
I have started to implement a different approach, where the GPU memory can be extracted from LRUs by the communication thread. Part of the code is in this PR, but I did not had the time to integrate it nicely.
Anyway, last but not least, a word about performance. On a small (4 A100 GPU, IB but no nvlink) I get
no CUDA-aware: [****] TIME(s) 7.53502 : dpotrf PxQxg= 2 2 1 NB= 1024 N= 65536 : 12452.141424 gflops - ENQ&PROG&DEST 8.28404 : 11326.255177 gflops - ENQ 0.49800 - DEST 0.25101
CUDA-aware: [****] TIME(s) 3.80899 : dpotrf PxQxg= 2 2 1 NB= 1024 N= 65536 : 24633.106084 gflops - ENQ&PROG&DEST 4.56897 : 20535.733216 gflops - ENQ 0.51097 - DEST 0.24902