diff --git a/sycl/doc/SYCLInstrumentationUsingXPTI.md b/sycl/doc/SYCLInstrumentationUsingXPTI.md new file mode 100644 index 0000000000000..7c983a336d3da --- /dev/null +++ b/sycl/doc/SYCLInstrumentationUsingXPTI.md @@ -0,0 +1,251 @@ +# SYCL Instrumentation + +Any language or programming paradigm must provide mechanisms to correlate a +developer's use of the language to the debug and performance traces for that +use. A lightweight tracing framework (XPTI) was developed to enable this for +SYCL and is the primary mechanism that is employed to enable debug and +performance traces. + +> **NOTE:** For additional information on the XPTI framework, please refer to +>the [Framework Documentation](https://github.com/intel/llvm/tree/sycl/xptifw/doc/XPTI_Framework.md) for API use +>and framework performance data. + +This document outlines the use of this framework API in the SYCL runtime +library. The primary concept enable by this framework is the generation of a +unique 64-bit ID, referred to as the Universal ID (UID), for every public +language entry point into the library. This allows tools and other helps in the +software stack to correlate debug and performance data by tagging it with the +64-bit UID. The framework also provides the ability to propagate this UID all +the way to the driver layers for the target device so data from lower layers and +hardware can be correlated easily. + +The XPTI concepts in use here are: + +1. Tracepoints - define all the points in a software layer we want to +instrument or trace. The trace point is used to generate the UID. +2. Notification - allows the software layer to communicate the trace +information to listeners/subscribers +3. Callback - implemented by subscribers to specific events to capture the +trace information + +The SYCL runtime layer defines the tracepoints and notifies the information +about any given tracepoint to a registered subscriber. These tracepoints are +enabled in meaningful locations of the runtime to provide semantic information +about the developer's use of the language. This would include information such +as relationships that form asynchronous task graphs or other constructs such +as barriers that are introduced while waiting on events. + +## Instrumentation Trace Points + +This section will document all the places in the SYCL runtime that have been +instrumented to capture the asynchronous task graphs created by the runtime. +The task graphs are captured as graph, nodes and edges: + +> - The graph encapsulates all of the disjoint task graphs generated by the application. +> - The nodes capture operations that are performed, such as kernel +executions or memory transfers +> - The edges represent dependence relationships, the representation of +which mimics control flow as opposed to a dependence graph. The source node +in an edge must complete before the target node can begin execution. + + All code changes to enable this have been guarded by + `XPTI_ENABLE_INSTRUMENTATION` macro and the CMake files have been updated to + have this as an option which is enabled by default and this change is under + `llvm/sycl/CMakeLists.txt`. + +```cmake +... +# Create a soft option for enabling or disabling the instrumentation +# of the SYCL runtime +option(SYCL_ENABLE_XPTI_TRACING "Enable tracing of SYCL constructs" ON) +``` + +### The Graph + +Any SYCL application can submit command groups to any active queue +during the lifetime of the application. Each submission is handled by the +runtime and the asynchronous task graphs are updated to reflect the new +submission. This may be as simple as adding a new node to the task-graph or +adding multiple nodes to the graph, where one of the nodes represents the +computation and the others dependent memory transfers. + +To model this, we create a global graph for every application instantiation +and all kernel executions in the applications are added as nodes in this +global graph. In the SYCL runtime, there is no obvious location where the +creation of the global graph can be inserted as many objects are +instantiated statically. Currently, we embed the graph creation in the +plugin interface (PI) layer `initialize()` call. In this call, we will +perform two operations: + +1. Initialize all listeners and create a trace event to represent the graph. +This is done in `sycl/include/CL/sycl/detail/pi.cpp`. +2. Send a `graph_create` event to all subscribers. This notification +will only be sent once. + +### The Nodes + +The command group lambdas are captured and encapsulated in a `Command` +object. This object is evaluated for dependencies on data/memory or external +OpenCL events and an asynchronous task graph is built by mapping all these +dependencies, before it is enqueued on the device. In order to capture the +command groups (nodes) and the dependencies (edges), the base class +`Command` and any derived classes that are of interest are instrumented. + +In this section, we discuss the instrumentation of the Command object in two +parts: (1) The changes made to capture end-user source code details for +language constructs (2) The instrumentation that handles capturing the +relevant metadata. + +1. In order to capture end-user source code information, we have implemented +`cl::sycl::detail::code_location` class that uses the builtin functions +in the compiler. However, equivalent implementations are unavailable on +Windows and separate cross-platform implementation might be used in the +future. To mitigate this, the Windows implementation will always report +`unknown_file`, `unknown_func` and a line number of 0 for source +file, function name and line number. We handle this case while processing +this information. + + The source information of a language construct, such as source file, + function name, line number and column number allow us to determine if a + Command that was previously created for a construct is being created + again. In such cases, we will not emit a `node_create` event, but we + will bump up the instance count recording the number of instances + created. Secondly, the source information allows us to associate a unique + ID with the source location and propagate it all the way to the driver, + if possible. This will allow us to associate a Kernel event with a source + location at all times. All instrumentation that identifies a command + object of a given type and emits the `node_create` event is located + in the `emitInstrumentationData()` and must be implemented by all + derived classes. + + To enable this source location information, we start with enabling the + public methods in the queue class, such as `queue.submit()`, + `queue.parallel_for()`, `queue.wait()`, etc to include a default + argument that captures the source location information. The location of + the line in the caller that makes the call to `queue.submit()`, + `queue.parallel_for()`, etc is represented in this default argument. + These changes are present in `queue.hpp` and `ordered_queue.hpp`. + The default arguments for all public functions are guarded by + `#ifdef SYCL_INSTRUMENTATION_METADATA` that is currently enabled by + default. + + The location information, when captured, is propagated all the way to the + `CommandGroup` object. So, for every `CommandGroup` object, we + will have the corresponding source location in end-user code where the + command group is submitted to the queue. This metadata is propagated by + the instrumentation to the subscribers of the stream. + +2. The base `Command class` and all derived classes are instrumented to capture + the relevant information for each command object and a `node_create` event is + generated. + +### The Node instance + +Once a command object is created, it is enqueued on the device for +execution. To capture the execution of this node instance, we instrument the +`enqueue()` method to determine the cost of this computation or memory +related kernel. As the commands are enqueued, the enqueue method emits a +pair of events indicating the `task_begin` and `task_end`events that +capture the duration of the enqueued command. For commands that are +asynchronous, the pair of events capture just the kernel submission and the +actual execution of the command on the device is tracked through the +`cl_event` returned by the enqueue operation. In the case of host kernel +execution or commands that are synchronous, the cost is measured directly. + +In the case of the command being submitted to an OpenCL device, we capture +the event of the submitted kernel and propagate it to the subscriber tool. +It is up to the tool to register a callback for this event completion and +close the task opened for the command object. + +### The Edges + +As discussed in the previous section, the command groups submitted to the +device queues form nodes in the asynchronous tasks graphs created by +the SYCL runtime. In addition to these nodes, based on the memory references +(through accessors or USM pointers), additional nodes to `allocate`, +`copy` and `release` are created and they are necessary for the +computation kernels to run. The computation kernel has dependencies on the +memory objects and these dependencies are recorded as `event`s and in +our model we represent them as edges between the dependent nodes. + +Tools monitoring the event stream then can start capturing the asynchronous +task graph as it is being built. As dependencies are added to a command +object, the instrumentation emits these dependencies as `edge_create` +events. Each of these `edge_create`events encapsulate the two command +objects that have a dependency through this edge. The source object of this +edge event must complete execution first before the target object of the +edge can begin execution. + +To instrument this part of the code, the `Command::addDep` methods of +the Command object are instrumented to create the trace points and notify +all subscribers. + +The `Release` command, as implemented in the SYCL runtime, has a +reference to the memory object, but no explicit dependencies are created. To +model the edges correctly, we instrument the `waitForRecordToFinish` method in +the `Scheduler` where the release operation waits on all the +dependent operations to complete to capture the edges. + +This concludes all the changes that were made to the SYCL runtime to support +tracing. The next section talks about the XPTI framework that allows +applications and runtimes to efficiently capture, record and emit trace +notifications for important events during the run. + +# Documentation of SYCL tracepoints +## XPTI Stream Domain + +Traces belong to a named stream and this constitutes a domain of data. The XPTI +framework allows the instrumentation logic to define a stream and associate the +traces to the stream. A stream also defines the protocol to be observed to +decipher the data at the receiving end. The XPTI API defines the notion of a +trace point that includes an event, a trace point type and a notification. + +- The **event** consists a payload that describes the event (`source file`, + `function name`, `line number` and/or a `code pointer`), a `unique_id` that + is used to identify the event, a `global user data field` and a mechanism to + record `metadata` associated with the event. The `unique_id` is generated + from the payload, so if the trace point is visited multiple times, it + represents the same `unique_id` and this allows us to determine the number of + instances of a trace point. + +- The **trace point type** defines the type of notification that is being + emitted for the trace point. There are many commonly occurring trace point + types that are predefined by the framework, but a stream can extend this + set by the extension APIs provided. A subscriber must explicitly register a + callback for each trace point type that is of interest to the subscriber. If + no subscribers are registered for a stream or a trace point type, then + traces will not be emitted. A given trace point event may be used to emit + multiple traces to different trace point types. + +- The **notification** emits the trace to all subscribers of the stream domain + that have a callback registered to the given trace point type. The stream + can attached a per-instance user data during this notification call that + *must* be guaranteed to be valid for the duration of the notification call. + +This document will outline the protocol for the streams of data being generated +by the SYCL runtime. + +## SYCL Stream `"sycl.pi"` Notification Signatures + +| Trace Point Type | Parameter Description | Metadata | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| `function_begin` |