diff --git a/sycl/doc/SYCLInstrumentationUsingXPTI.md b/sycl/doc/SYCLInstrumentationUsingXPTI.md new file mode 100644 index 0000000000000..7c983a336d3da --- /dev/null +++ b/sycl/doc/SYCLInstrumentationUsingXPTI.md @@ -0,0 +1,251 @@ +# SYCL Instrumentation + +Any language or programming paradigm must provide mechanisms to correlate a +developer's use of the language to the debug and performance traces for that +use. A lightweight tracing framework (XPTI) was developed to enable this for +SYCL and is the primary mechanism that is employed to enable debug and +performance traces. + +> **NOTE:** For additional information on the XPTI framework, please refer to +>the [Framework Documentation](https://github.com/intel/llvm/tree/sycl/xptifw/doc/XPTI_Framework.md) for API use +>and framework performance data. + +This document outlines the use of this framework API in the SYCL runtime +library. The primary concept enable by this framework is the generation of a +unique 64-bit ID, referred to as the Universal ID (UID), for every public +language entry point into the library. This allows tools and other helps in the +software stack to correlate debug and performance data by tagging it with the +64-bit UID. The framework also provides the ability to propagate this UID all +the way to the driver layers for the target device so data from lower layers and +hardware can be correlated easily. + +The XPTI concepts in use here are: + +1. Tracepoints - define all the points in a software layer we want to +instrument or trace. The trace point is used to generate the UID. +2. Notification - allows the software layer to communicate the trace +information to listeners/subscribers +3. Callback - implemented by subscribers to specific events to capture the +trace information + +The SYCL runtime layer defines the tracepoints and notifies the information +about any given tracepoint to a registered subscriber. These tracepoints are +enabled in meaningful locations of the runtime to provide semantic information +about the developer's use of the language. This would include information such +as relationships that form asynchronous task graphs or other constructs such +as barriers that are introduced while waiting on events. + +## Instrumentation Trace Points + +This section will document all the places in the SYCL runtime that have been +instrumented to capture the asynchronous task graphs created by the runtime. +The task graphs are captured as graph, nodes and edges: + +> - The graph encapsulates all of the disjoint task graphs generated by the application. +> - The nodes capture operations that are performed, such as kernel +executions or memory transfers +> - The edges represent dependence relationships, the representation of +which mimics control flow as opposed to a dependence graph. The source node +in an edge must complete before the target node can begin execution. + + All code changes to enable this have been guarded by + `XPTI_ENABLE_INSTRUMENTATION` macro and the CMake files have been updated to + have this as an option which is enabled by default and this change is under + `llvm/sycl/CMakeLists.txt`. + +```cmake +... +# Create a soft option for enabling or disabling the instrumentation +# of the SYCL runtime +option(SYCL_ENABLE_XPTI_TRACING "Enable tracing of SYCL constructs" ON) +``` + +### The Graph + +Any SYCL application can submit command groups to any active queue +during the lifetime of the application. Each submission is handled by the +runtime and the asynchronous task graphs are updated to reflect the new +submission. This may be as simple as adding a new node to the task-graph or +adding multiple nodes to the graph, where one of the nodes represents the +computation and the others dependent memory transfers. + +To model this, we create a global graph for every application instantiation +and all kernel executions in the applications are added as nodes in this +global graph. In the SYCL runtime, there is no obvious location where the +creation of the global graph can be inserted as many objects are +instantiated statically. Currently, we embed the graph creation in the +plugin interface (PI) layer `initialize()` call. In this call, we will +perform two operations: + +1. Initialize all listeners and create a trace event to represent the graph. +This is done in `sycl/include/CL/sycl/detail/pi.cpp`. +2. Send a `graph_create` event to all subscribers. This notification +will only be sent once. + +### The Nodes + +The command group lambdas are captured and encapsulated in a `Command` +object. This object is evaluated for dependencies on data/memory or external +OpenCL events and an asynchronous task graph is built by mapping all these +dependencies, before it is enqueued on the device. In order to capture the +command groups (nodes) and the dependencies (edges), the base class +`Command` and any derived classes that are of interest are instrumented. + +In this section, we discuss the instrumentation of the Command object in two +parts: (1) The changes made to capture end-user source code details for +language constructs (2) The instrumentation that handles capturing the +relevant metadata. + +1. In order to capture end-user source code information, we have implemented +`cl::sycl::detail::code_location` class that uses the builtin functions +in the compiler. However, equivalent implementations are unavailable on +Windows and separate cross-platform implementation might be used in the +future. To mitigate this, the Windows implementation will always report +`unknown_file`, `unknown_func` and a line number of 0 for source +file, function name and line number. We handle this case while processing +this information. + + The source information of a language construct, such as source file, + function name, line number and column number allow us to determine if a + Command that was previously created for a construct is being created + again. In such cases, we will not emit a `node_create` event, but we + will bump up the instance count recording the number of instances + created. Secondly, the source information allows us to associate a unique + ID with the source location and propagate it all the way to the driver, + if possible. This will allow us to associate a Kernel event with a source + location at all times. All instrumentation that identifies a command + object of a given type and emits the `node_create` event is located + in the `emitInstrumentationData()` and must be implemented by all + derived classes. + + To enable this source location information, we start with enabling the + public methods in the queue class, such as `queue.submit()`, + `queue.parallel_for()`, `queue.wait()`, etc to include a default + argument that captures the source location information. The location of + the line in the caller that makes the call to `queue.submit()`, + `queue.parallel_for()`, etc is represented in this default argument. + These changes are present in `queue.hpp` and `ordered_queue.hpp`. + The default arguments for all public functions are guarded by + `#ifdef SYCL_INSTRUMENTATION_METADATA` that is currently enabled by + default. + + The location information, when captured, is propagated all the way to the + `CommandGroup` object. So, for every `CommandGroup` object, we + will have the corresponding source location in end-user code where the + command group is submitted to the queue. This metadata is propagated by + the instrumentation to the subscribers of the stream. + +2. The base `Command class` and all derived classes are instrumented to capture + the relevant information for each command object and a `node_create` event is + generated. + +### The Node instance + +Once a command object is created, it is enqueued on the device for +execution. To capture the execution of this node instance, we instrument the +`enqueue()` method to determine the cost of this computation or memory +related kernel. As the commands are enqueued, the enqueue method emits a +pair of events indicating the `task_begin` and `task_end`events that +capture the duration of the enqueued command. For commands that are +asynchronous, the pair of events capture just the kernel submission and the +actual execution of the command on the device is tracked through the +`cl_event` returned by the enqueue operation. In the case of host kernel +execution or commands that are synchronous, the cost is measured directly. + +In the case of the command being submitted to an OpenCL device, we capture +the event of the submitted kernel and propagate it to the subscriber tool. +It is up to the tool to register a callback for this event completion and +close the task opened for the command object. + +### The Edges + +As discussed in the previous section, the command groups submitted to the +device queues form nodes in the asynchronous tasks graphs created by +the SYCL runtime. In addition to these nodes, based on the memory references +(through accessors or USM pointers), additional nodes to `allocate`, +`copy` and `release` are created and they are necessary for the +computation kernels to run. The computation kernel has dependencies on the +memory objects and these dependencies are recorded as `event`s and in +our model we represent them as edges between the dependent nodes. + +Tools monitoring the event stream then can start capturing the asynchronous +task graph as it is being built. As dependencies are added to a command +object, the instrumentation emits these dependencies as `edge_create` +events. Each of these `edge_create`events encapsulate the two command +objects that have a dependency through this edge. The source object of this +edge event must complete execution first before the target object of the +edge can begin execution. + +To instrument this part of the code, the `Command::addDep` methods of +the Command object are instrumented to create the trace points and notify +all subscribers. + +The `Release` command, as implemented in the SYCL runtime, has a +reference to the memory object, but no explicit dependencies are created. To +model the edges correctly, we instrument the `waitForRecordToFinish` method in +the `Scheduler` where the release operation waits on all the +dependent operations to complete to capture the edges. + +This concludes all the changes that were made to the SYCL runtime to support +tracing. The next section talks about the XPTI framework that allows +applications and runtimes to efficiently capture, record and emit trace +notifications for important events during the run. + +# Documentation of SYCL tracepoints +## XPTI Stream Domain + +Traces belong to a named stream and this constitutes a domain of data. The XPTI +framework allows the instrumentation logic to define a stream and associate the +traces to the stream. A stream also defines the protocol to be observed to +decipher the data at the receiving end. The XPTI API defines the notion of a +trace point that includes an event, a trace point type and a notification. + +- The **event** consists a payload that describes the event (`source file`, + `function name`, `line number` and/or a `code pointer`), a `unique_id` that + is used to identify the event, a `global user data field` and a mechanism to + record `metadata` associated with the event. The `unique_id` is generated + from the payload, so if the trace point is visited multiple times, it + represents the same `unique_id` and this allows us to determine the number of + instances of a trace point. + +- The **trace point type** defines the type of notification that is being + emitted for the trace point. There are many commonly occurring trace point + types that are predefined by the framework, but a stream can extend this + set by the extension APIs provided. A subscriber must explicitly register a + callback for each trace point type that is of interest to the subscriber. If + no subscribers are registered for a stream or a trace point type, then + traces will not be emitted. A given trace point event may be used to emit + multiple traces to different trace point types. + +- The **notification** emits the trace to all subscribers of the stream domain + that have a callback registered to the given trace point type. The stream + can attached a per-instance user data during this notification call that + *must* be guaranteed to be valid for the duration of the notification call. + +This document will outline the protocol for the streams of data being generated +by the SYCL runtime. + +## SYCL Stream `"sycl.pi"` Notification Signatures + +| Trace Point Type | Parameter Description | Metadata | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| `function_begin` |
  • **trace_type**: `xpti::trace_point_type_t::function_begin` that marks the beginning of a function
  • **parent**: Event ID created for all functions in the `sycl.pi` layer.
  • **event**: `nullptr` - since the stream of data just captures functions being called.
  • **instance**: Unique ID to allow the correlation of the `function_begin` event with the `function_end` event.
  • **user_data**: Name of the function being called sent in as `const char *`
  • | None | +| `function_end` |
  • **trace_type**: `xpti::trace_point_type_t::function_end` that marks the beginning of a function
  • **parent**: Event ID created for all functions in the `sycl.pi` layer.
  • **event**: `nullptr` - since the stream of data just captures functions being called.
  • **instance**: Unique ID to allow the correlation of the `function_begin` event with the `function_end` event. This value is guaranteed to be the same value received by the trace event for the corresponding `function_begin`
  • **user_data**: Name of the function being called sent in as `const char *`
  • | None | + +## SYCL Stream `"sycl"` Notification Signatures + +All trace point types in bold provide semantic information about the graph, nodes and edges and the toplogy of the asynchronous task graphs created by the runtime. +| Trace Point Type | Parameter Description | Metadata | +| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **`graph_create`** |
  • **trace_type**: `xpti::trace_point_type_t::graph_create` that marks the creation of an asynchronous graph.
  • **parent**: `nullptr`
  • **event**: The global asynchronous graph object ID. All other graph related events such as node and edge creation will always this ID as the parent ID.
  • **instance**: Unique ID related to the event, but not a correlation ID as there are other events to correlate to.
  • **user_data**: `nullptr`
  • SYCL runtime will always have one instance of a graph object with many disjoint subgraphs that get created during the execution of an application. | None | +| **`node_create`** |
  • **trace_type**: `xpti::trace_point_type_t::node_create` that marks the creation of a node in the graph, which could be a computational kernel or memory operation.
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The unique ID that identifies the data parallel compute operation or memory operation.
  • **instance**: Unique ID related to the event, but not a correlation ID as there are other events to correlate to.
  • **user_data**: Command type that has been submitted through the command group handler, which could be one of: `command_group_node`, `memory_transfer_node`, `memory_allocation_node`, `sub_buffer_creation_node`, `memory_deallocation_node`, `host_acc_create_buffer_lock_node`, `host_acc_destroy_buffer_release_node` combined with the address of the command group object and represented as a string [`const char *`]
  • SYCL runtime will always have one instance of a graph object with many disjoint subgraphs that get created during the execution of an application. |
  • Computational Kernels
  • `sycl_device`, `kernel_name`, `from_source`, `sym_function_name`, `sym_source_file_name`, `sym_line_no`
  • Memory operations
  • `memory_object`, `offset`, `access_range`, `allocation_type`, `copy_from`, `copy_to` | +| **`edge_create`** |
  • **trace_type**: `xpti::trace_point_type_t::graph_create` that marks the creation of an asynchronous graph.
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The unique ID that identifies the dependence relationship between two operations.
  • **instance**: Unique ID related to the event, but not a correlation ID as there are other events to correlate to.
  • **user_data**: `nullptr`
  • Edges capture dependence relationships between computations or computations and memory operations. | `access_mode`, `memory_object`, `event` | +| `task_begin` |
  • **trace_type**: `xpti::trace_point_type_t::task_begin` that marks the beginning of a task belonging to one of the nodes in the graph. When the trace event is for a kernel executing on a device other than the the CPU, this `task_begin` and corresponding `task_end` mark the submit call. To track the execution of the kernel on the device, the `trace_signal` event must be monitored to get the kernel event handle from which the execution statistics can be gathered.
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The event ID will reflect the ID of the computation or memory operation kernel, which would be one of the nodes in the graph.
  • **instance**: Instance ID for the task that can be used to correlate it with the corresponding `task_end` trace event.
  • **user_data**: `nullptr`
  • | Same metadata defined for the node the trace task belongs to. | +| `task_end` |
  • **trace_type**: `xpti::trace_point_type_t::task_end` that marks the end of a task belonging to one of the nodes in the graph. The specific task instance can be tacked through the instance ID parameter which helps correlate the `task_end` with the corresponding `task_begin`.
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The event ID will reflect the ID of the computation or memory operation kernel, which would be one of the nodes in the graph.
  • **instance**: Instance ID for the task that can be used to correlate it with the corresponding `task_begin` trace event.
  • **user_data**: `nullptr`
  • | Same metadata defined for the node the trace task belongs to. | +| `signal` |
  • **trace_type**: `xpti::trace_point_type_t::signal` that marks the an event that contains the `event` handle of an executing kernel on a device.
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The event ID will reflect the ID of the computation or memory operation kernel, which would be one of the nodes in the graph.
  • **instance**: Instance ID for the task for which the signal has been generated.
  • **user_data**: Address of the kernel event that is returned by the device so the progress of the execution can be tracked.
  • | Same metadata defined for the node the trace task belongs to. | +| `wait_begin` |
  • **trace_type**: `xpti::trace_point_type_t::wait_begin` that marks the beginning of the wait on an `event`
  • **parent**: `nullptr`
  • **event**: The event ID will reflect the ID of the command group object submission that created this event or the graph event, if the event is an external event.
  • **instance**: Unique ID to allow the correlation of the `wait_begin` event with the `wait_end` event.
  • **user_data**: String indicating `event.wait` and the address of the event sent in as `const char *`
  • Tracing the `event.wait()` or `event.wait_and_throw()` will capture the waiting on the action represented by the event object, which could be the execution of a kernel, completion of a memory operation, etc. | None | +| `wait_end` |
  • **trace_type**: `xpti::trace_point_type_t::wait_end` that marks the beginning of the wait on an `event`
  • **parent**: `nullptr`
  • **event**: The event ID will reflect the ID of the command group object submission that created this event or the graph event, if the event is an external event.
  • **instance**: Unique ID to allow the correlation of the `wait_begin` event with the `wait_end` event.
  • **user_data**: String indicating `event.wait` and the address of the event sent in as `const char *`
  • | None | +| `wait_begin` |
  • **trace_type**: `xpti::trace_point_type_t::wait_begin` that marks the beginning of the wait on an `event`
  • **parent**: `nullptr`
  • **event**: The event ID will reflect the ID of the command group object submission that created this event or a new event based on the combination of the string "queue.wait" and the address of the event.
  • **instance**: Unique ID to allow the correlation of the `wait_begin` event with the `wait_end` event.
  • **user_data**: String indicating `queue.wait` and the address of the event sent in as `const char *`
  • Tracing the `queue.wait()` or `queue.wait_and_throw()` will capture the waiting on the action represented by the event object, which could be the execution of a kernel, completion of a memory operation, etc that is embedded in the command group handler. All wait events contain metadata that indicates the SYCL device on which the corresponding operation has been submitted. If the event is from a command group handler, then the source location information is available as well. | **`sycl_device`**, `sym_function_name`, `sym_source_file_name`, `sym_line_no` | +| `wait_end` |
  • **trace_type**: `xpti::trace_point_type_t::wait_end` that marks the beginning of the wait on an `event`
  • **parent**: `nullptr`
  • **event**: The event ID will reflect the ID of the command group object submission that created this event or a new event based on the combination of the string "queue.wait" and the address of the event.
  • **instance**: Unique ID to allow the correlation of the `wait_begin` event with the `wait_end` event.
  • **user_data**: String indicating `queue.wait` and the address of the event as `const char *`
  • | **`sycl_device`**, `sym_function_name`, `sym_source_file_name`, `sym_line_no` | +| `barrier_begin` |
  • **trace_type**: `xpti::trace_point_type_t::barrier_begin` that marks the beginning of a barrier while enqueuing a command group object
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The event ID will reflect the ID of the command group object that has encountered a barrier during the enqueue operation.
  • **instance**: Unique ID to allow the correlation of the `barrier_begin` event with the `barrier_end` event.
  • **user_data**: String indicating `enqueue.barrier` and the reason for the barrier as a `const char *`
  • The reason for the barrier could be one of `Buffer locked by host accessor`, `Blocked by host task` or `Unknown reason`. |
  • Computational Kernels
  • `sycl_device`, `kernel_name`, `from_source`, `sym_function_name`, `sym_source_file_name`, `sym_line_no`
  • Memory operations
  • `memory_object`, `offset`, `access_range`, `allocation_type`, `copy_from`, `copy_to` | +| `barrier_end` |
  • **trace_type**: `xpti::trace_point_type_t::barrier_end` that marks the end of the barrier that is encountered during enqueue.
  • **parent**: The global graph event that is created during the `graph_create` event.
  • **event**: The event ID will reflect the ID of the command group object that has encountered a barrier during the enqueue operation.
  • **instance**: Unique ID to allow the correlation of the `barrier_begin` event with the `barrier_end` event.
  • **user_data**: String indicating `enqueue.barrier` and the reason for the barrier as a `const char *`
  • The reason for the barrier could be one of `Buffer locked by host accessor`, `Blocked by host task` or `Unknown reason`. |
  • Computational Kernels
  • `sycl_device`, `kernel_name`, `from_source`, `sym_function_name`, `sym_source_file_name`, `sym_line_no`
  • Memory operations
  • `memory_object`, `offset`, `access_range`, `allocation_type`, `copy_from`, `copy_to` | diff --git a/sycl/doc/index.rst b/sycl/doc/index.rst index 0caadaf117923..f32e12a53715e 100644 --- a/sycl/doc/index.rst +++ b/sycl/doc/index.rst @@ -38,4 +38,5 @@ Developing oneAPI DPC++ Compiler SharedLibraries MultiTileCardWithLevelZero OptionalDeviceFeatures + SYCLInstrumentationUsingXPTI