-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PTI-SDK] Device / context-based buffers instead of thread-based buffers #54
Comments
Even with one OpenMP thread, we can run into the issue where the event order is somehow mixed up or timestamps make no sense. [...]
[Score-P - 1] src/adapters/level0/scorep_level0_event_device.c:157: [428180] PTI-SDK Kernel 56 __omp_offloading_10303_1be4c1a__Z4main_l9 @ :0
Start = 1704810304039967747 -> End = 1704810304040177642 | Append = 1704810304039961263 | Submit = 1704810304039967747
[Score-P - 1] src/adapters/level0/scorep_level0_event_device.c:157: [428180] PTI-SDK Kernel 57 __omp_offloading_10303_1be4c1a__Z4main_l9 @ :0
Start = 1704810304040164963 -> End = 1704810304040168296 | Append = 1704810304040157397 | Submit = 1704810304040164963
[Score-P] src/measurement/scorep_location_management.c:455: Fatal: Bug 'timestamp < location->last_timestamp': Wrong timestamp order on location 2: 1704810304040177642 (last recorded) > 1704810304040164963 (current). This might be an indication of thread migration. Please pin your threads. Using a SCOREP_TIMER different from tsc might also help. Since those two kernels run on the same context, device and command queue, shouldn't it be possible to get this result? |
RE: Per-thread buffer. So, on applications with multiple threads, wouldn't operating on the buffer on a per-thread basis reduce the amount of synchronization required to insert records into the buffer? This simplifies the code and potentially increases the performance of the SDK (less time spent holding a lock). However, I do understand that the records can and will be out of order in a lot of cases. However, are you suggesting we should guarantee the order in which records are returned to the user? If we can guarantee the device / context are not shared across threads, I could see a similar thing working. However, we don't have a way of determining that from queue. Maybe a per-command list buffer could work?
L0 says its valid to have multiple host threads sharing the same command queue https://spec.oneapi.io/level-zero/latest/core/PROG.html#command-queues.
Command list could work maybe?
But we would be restricted from access the buffer from outside of an operation on the command list. Maybe you could flush buffers before the operation and after? Or maybe we could introduce our own "thread id" that monotonically increases and can be used from another container that guarantees an order when the buffers are flushed (like std::map)? a RE: timestamp issue. Will look into it more, thanks! |
Thanks for the feedback. You're absolutely right that recording in a per-thread buffer increases the performance of the SDK and reduces overhead significantly. My biggest concern with per-thread buffers is though, that one thread might contain a single event for one device and dispatches it at the end of the whole program execution for example. Since tool developers (ideally) do not want to leave this event out, tool developers need to think about how they want to handle the other events in the meantime. The logical solution would be to store all other events somewhere, but this increases memory demands and maybe requires evaluation to wait until the end of program execution before being able to process the events. Switching to buffers based on contexts, devices or command queues / command lists increases the overhead, but improves the situation. On a command queue level I would expect to see the lowest additional overhead. It also matches the level where we are writing events to our internal locations. We could still run into issues with ordering when multiple threads write to the buffer. But this is the case for CUPTI as well, where the documentation mentions:
Tool developers would have to take this into account and store the events per CUPTI stream temporarily until all gaps in the event IDs have been filled by outstanding buffers. But since we're working on a stream (CUPTI) / command queue (LEVEL ZERO) level, those events should be both more closely together and do not prevent other command queues from being processed, since they use distinct buffers. I guess that command lists would work as well. If we receive a flush after a few command lists are finished, we could be sure that all events are there and only need to sort them. This could be done either by timestamps, or the I hope that this is understandable 😄 |
Signed-off-by: Yao, Yi <yi.yao@intel.com>
Understood! I think this is a valid configuration option for PTI and your views on floating records are helpful. And maybe we can even have less floating records with using the new queue_id provided by the compiler? Putting it in our backlog and we'll keep this issue open until we make a decision if/when we should be implementing this. |
Hi @Thyre. Thank you for submitting this issue. We implemented (at least partial) fix 9482485. But at least for now it fixes the issue that you reported above (when device operation records submitted by different threads were found in one buffer). |
Thanks a lot for the update @jfedorov. I'll check how these changes can improve using PTI-SDK for Score-P. |
Device / context-based buffers instead of thread-based buffers
While continuing to evaluate how we may be able to use PTI-SDK for support of Level Zero as an adapter in Score-P, I've ran into the following issue:
Right now, PTI-SDK collects events for different kinds of activities on accelerators, which can be enabled through
ptiViewSetCallbacks
. At some point during program execution, the implementedbuffer_request
function will be called. If requested or when a buffer is full, the SDK may dispatch a callback for buffer evaluation. This is totally fine. However, I noticed a detail, significantly complicating the handling of programs using multiple threads to dispatch events.To illustrate the issue, we can look at the following (very simple) OpenMP offload program:
We have eight threads working in parallel on a single accelerator. This does work and events are correctly captured by PTI-SDK. Now, lets look at how they are captured.
How PTI-SDK PoC currently captures events
Events can be generally found in
view_handler.h
. For simplicity, we focus onMemCopyEvent
but others follow the same principle.At the end of the event method, a call to
Instance().InsertRecord(...)
is being done. This is a templated method with the following codeNote the way we determine the buffer. This is done through the unique id of the thread writing the event. In the parallel OpenMP region, this is the executing thread. Looking further at how the buffers are implemented, we end up here:
using ViewBufferTable = ThreadSafeHashTable<KeyT, ViewBuffer>;
.This means, that events are stored in a buffer and accessed through a hash table with the thread id being the key.
What the current implementation does
Regardless on the devices, contexts, and command queues being used by a thread, events are stored on a thread basis. This can cause issues if tools require events to be written in a certain way. In Score-P for example, we require our locations (where we store our events) to write events in timestamp order. With PTI-SDK however, this is quite difficult. Let's look at the output of the example above with some interface:
Click to open
The output is pretty large, but shows a weird thing. The following entry can be found in the buffer for Kernel Thread Id = 670104, even though the event is from another Kernel Thread Id
If we evaluate the first buffer first and then the second one, we will end up with timestamp errors coming from Score-P, since 1704727757057692365 (first event of second buffer) < 1704727757061632913 (wrong event in first buffer).
The issue
From my understanding, each thread will execute events on a separate command queue, if possible. My question here is: Is it possible that command queues are used by multiple threads at the same time?
In general, I am a bit skeptical about using thread ids as the key. If a buffer is not completely filled, but contains events for a context, device, or command queue and is flushed at the end of the program, performance tools need to store all events happening during program execution because there might be an event which gets missed or cause other issues otherwise.
For the behavior shown above, there seem to be events stored incorrectly, as I wouldn't expect to see a thread id for another thread in that buffer.
Side note
It seems like this isn't the only issue with multiple threads. When running the program multiple times, I've also ran into the following error:
Reproducer
You can use the following code to reproduce the issue:
pti_sdk_openmp_world.zip
To run the example, use the following command:
Environment
The text was updated successfully, but these errors were encountered: