Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asynchronous offload #7

Open
2 of 3 tasks
lucaparisi91 opened this issue Jun 26, 2024 · 5 comments
Open
2 of 3 tasks

asynchronous offload #7

lucaparisi91 opened this issue Jun 26, 2024 · 5 comments

Comments

@lucaparisi91
Copy link
Collaborator

lucaparisi91 commented Jun 26, 2024

  • map dependencies
  • kernel dependencies ( openmp tasks ? )
  • openmp detach functionality
Compiler Support Notes
nvhpc 24.5 Yes No overlapping with data transfers
clang 18.1.8 Yes
cce 16.0.1 Yes Requires multiple CPU threads for concurrent launches
@lucaparisi91
Copy link
Collaborator Author

lucaparisi91 commented Jul 15, 2024

Async Map

#pragma omp map(to: a[:N] ) nowait

In OpenMP 5.0 a task can be detached, even if not completed. Needs to use API to signal the completion of the task.

omp_event_handle_t *event;

#pragma omp task A 
{
}
#pragma omp task B detach(event)
{
    do_stuff()
    hipStreamAddCallback(stream,callback,&hip_event,0)

}

void callback(hipStream_t stream, hipError_t status, void * cb_dat){
omp_fullfill_event( *(omp_event_handle_t *) cb_data );
}
#pragma omp taskwait;

@lucaparisi91
Copy link
Collaborator Author

lucaparisi91 commented Aug 5, 2024

The main_single_transfer.cpp file demonstrates using tasking and openmp offload.
It runs in parallel with both nnhpc 24.5 and clang 18.1.8. Below a screenshot for nvidia nvhpc 24.5.
image

@lucaparisi91
Copy link
Collaborator Author

lucaparisi91 commented Aug 5, 2024

The main_multiple_transfer.cpp file demonstrate overlapping computation and memory transfer.
This kind-of works with clang 18.1.8 . See the image below. Each task creates a kernel in a new gpu thread.
image

The nvidia compiler also creates different streams , however there does not seem to be any overlap between the running kernels or with the memory transfer, as per the screenshot below.
image

@lucaparisi91
Copy link
Collaborator Author

lucaparisi91 commented Aug 15, 2024

On A2, with cce 16.0.1 , main_single_transfer.cpp runs in serial. However they do run concurrently when using multiple threads. Below a screenshot for 4 threads with main_single_transfer_multiple_threads.cpp. The number of threads is equal to the number of compilers.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant