You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Handling of host events with Level Zero PTI-SDK PoC
This issue represents more of a discussion for us to know how potential host events will be handeled in PTI-SDK for us to prepare.
What is the current situation in the PTI-SDK PoC (as of December 19th 2023)
PTI-SDK PoC offers a simple interface for potential tools. Simply said, tools can register two callback functions, where one returns a buffer upon request and the second one is called when this buffer is being flushed.
Tools can enable or disable certain parts of this interface to only look at the operations being of interest. From our point of view, these kinds will be interesting for us.
We decide against SYCL and OpenCL, since we have an adapter for OpenCL and prefer to have a standardised SYCL adapter at some point.
During the buffer flush event, we receive information about the device, queue and context. This is enough to reconstruct our internal structure and write events.
There's one issue from our side however... right now, we are not able to write a profile or trace successfully. This can be reduced to a single issue: There are no host events (with PTI-SDK only)!
How Score-P handles accelerators in other adapters
I'm mostly working on development for our OpenMP adapter, including support for OpenMP offloading, but will try to explain it as best as I can.
Score-P includes several adapters for accelerator libraries, including ROCprofiler/ROCtracer, CUPTI and (in development) OpenMP offload. All those adapters follow a similar principle to PTI-SDK PoC. There is some kind of buffer where events are being stored. At some point, this buffer is flushed and we can write events to locations based on streams, contexts and so on.
For this, devices need to be known before we're writing the events. Especially OpenMP offload is tricky, since events arrive on threads not known by Score-P (essentially helper threads). Here, libraries diverge a bit, but offer the same idea in principle: Callbacks that are triggered on the host.
OpenMP offload takes the simplest approach. At some point a device will need to be initialized and we get a ompt_callback_device_initialize with all required information. For CUPTI, we register a callback via cuptiSubscribe, for ROCtracer we use roctracer_enable_op_callback. On callback calls, we try to find the context/stream and create our internal structures if it isn't found.
In the case of PTI-SDK PoC, there is no such thing (yet). There are only events in a buffer related to the devices. All host events would need to get registered though the low-level Level0 interface, which seems counterintuitive.
Questions
Will PTI-SDK handle any kind of host events, similar to CUPTI, rocTracer and other frameworks?
In the current state, tool developers would need to implement both parts of the Level0 interface and PTI-SDK to get a functional adapters. Which is, to be honest, still easier than completely implementing everything with Level0. If that's the plan going forward, there should be at least a short guide on how to implement things. The examples in this repository can be overwhelming to look at. The Tools Programming Guide here doesn't help either, especially since the API Tracing, which would be the most interesting section for us, is being deprecated. The new (?) interface can instead be found hidden in the Level0 repository (see here)
How will those host events be delivered to the tool?
Looking at _pti_view_kind I fear that we will receive host events the same way we get accelerator events: On a buffer at some point during program execution. Simply said: This will not work for our tool, since we require events for a location to be added in timestamp order. PTI-SDK would be the exception here, with all other APIs delivering the events on time.
The text was updated successfully, but these errors were encountered:
Hello @Thyre ,
thank you very much for your feedback!
if I understood you right - you are asking about Callback APIs like CUPTI has.
If so - we have this on our list reasonably high. Although at the moment - this is not the first priority, for example, in comparison with the overhead and start/stop.
The question for you - are you expecting such callbacks for Level0 and SYCL runtime APIs? Or in the some reasonable time interval - callback for low level API as Level0 would suffice?
Having callbacks for Level0 only would be sufficient as we want to focus on that first.
For SYCL, we hope that the standard gets a standardized tools API at some point in the future which isn't the case yet AFAIK.
Handling of host events with Level Zero PTI-SDK PoC
This issue represents more of a discussion for us to know how potential host events will be handeled in PTI-SDK for us to prepare.
What is the current situation in the PTI-SDK PoC (as of December 19th 2023)
PTI-SDK PoC offers a simple interface for potential tools. Simply said, tools can register two callback functions, where one returns a buffer upon request and the second one is called when this buffer is being flushed.
Tools can enable or disable certain parts of this interface to only look at the operations being of interest. From our point of view, these kinds will be interesting for us.
We decide against SYCL and OpenCL, since we have an adapter for OpenCL and prefer to have a standardised SYCL adapter at some point.
During the buffer flush event, we receive information about the device, queue and context. This is enough to reconstruct our internal structure and write events.
There's one issue from our side however... right now, we are not able to write a profile or trace successfully. This can be reduced to a single issue: There are no host events (with PTI-SDK only)!
How Score-P handles accelerators in other adapters
I'm mostly working on development for our OpenMP adapter, including support for OpenMP offloading, but will try to explain it as best as I can.
Score-P includes several adapters for accelerator libraries, including ROCprofiler/ROCtracer, CUPTI and (in development) OpenMP offload. All those adapters follow a similar principle to PTI-SDK PoC. There is some kind of buffer where events are being stored. At some point, this buffer is flushed and we can write events to locations based on streams, contexts and so on.
For this, devices need to be known before we're writing the events. Especially OpenMP offload is tricky, since events arrive on threads not known by Score-P (essentially helper threads). Here, libraries diverge a bit, but offer the same idea in principle: Callbacks that are triggered on the host.
OpenMP offload takes the simplest approach. At some point a device will need to be initialized and we get a
ompt_callback_device_initialize
with all required information. For CUPTI, we register a callback viacuptiSubscribe
, for ROCtracer we useroctracer_enable_op_callback
. On callback calls, we try to find the context/stream and create our internal structures if it isn't found.In the case of PTI-SDK PoC, there is no such thing (yet). There are only events in a buffer related to the devices. All host events would need to get registered though the low-level Level0 interface, which seems counterintuitive.
Questions
Will PTI-SDK handle any kind of host events, similar to CUPTI, rocTracer and other frameworks?
In the current state, tool developers would need to implement both parts of the Level0 interface and PTI-SDK to get a functional adapters. Which is, to be honest, still easier than completely implementing everything with Level0. If that's the plan going forward, there should be at least a short guide on how to implement things. The examples in this repository can be overwhelming to look at. The Tools Programming Guide here doesn't help either, especially since the API Tracing, which would be the most interesting section for us, is being deprecated. The new (?) interface can instead be found hidden in the Level0 repository (see here)
How will those host events be delivered to the tool?
Looking at
_pti_view_kind
I fear that we will receive host events the same way we get accelerator events: On a buffer at some point during program execution. Simply said: This will not work for our tool, since we require events for a location to be added in timestamp order. PTI-SDK would be the exception here, with all other APIs delivering the events on time.The text was updated successfully, but these errors were encountered: