diff --git a/sycl/doc/design/DeviceIf.md b/sycl/doc/design/DeviceIf.md index 69d960af171c6..d82655fcfff33 100644 --- a/sycl/doc/design/DeviceIf.md +++ b/sycl/doc/design/DeviceIf.md @@ -337,4 +337,1011 @@ in that compilation phase. ## Phase 2 -TBD. +The second phase supports both AOT and JIT modes, so this design replaces the +"phase 1" design described above. + +This design requires changes to the headers, a new LLVM IR pass that runs at +the start of the pipeline, changes to the LLVM IR aspect propagation pass, +changes to the `sycl-post-link` tool, and changes to the DPC++ runtime. The +bullets that follow are a high-level overview of the design. After the +overview, each step is described in more detail. + +* The extension API allows the application to conditionally call a C++ callable + object based on whether the device has certain aspects or a certain + architecture. The callable is frequently a lambda expression. + +* The header files use template metaprogramming which emits LLVM IR like this + pseudocode for each of these callables: + + ``` + call void @call_if_on_device_conditionallyXXX(%callablethis, N1, N2, ...) + + define void @call_if_on_device_conditionallyXXX(%callablethis, %n1, %n2, ...) + call void @CallableXXX(%callablethis) + } + ``` + + where `%callablethis` is the "this" pointer for the application's callable + object. The parameters `N1`, `N2`, etc. are integer constants that define a + simple expression language which tells the set of aspects or architectures + that the device must have in order to enable the call. + +* A new IR pass which runs before any optimizations translates this IR to + pseudocode that looks like: + + ``` + call void @call_if_on_device_conditionallyXXX(@CallableXXX, %callablethis, N1, N2, ...) + + declare void @call_if_on_device_conditionallyXXX(%callable, %callablethis, %n1, %n2, ...) + ``` + + Note that the body of the function `@call_if_on_device_conditionallyXXX` has + been removed, so there is no longer any call to `@CallableXXX`. The IR can + therefore be safely optimized without risk of inlining any of the callable's + statements at its call site and without any danger of optimizing away the + constants `N1`, `N2`, etc. + +* After all of the device code IR has been optimized and linked together, the + `sycl-post-link` tool does several things: + + * It translates the call site IR to look like: + + ``` + call void @CallableXXX(%callablethis) + ``` + + * If the target is SPIR-V: + + * It moves the definition of `@CallableXXX` and its entire call tree to a + separate "add on" device image which defines this function as an exported + symbol. + + * It creates a second "add on" device image that defines `@CallableXXX` + with an empty body. + + * It creates a property set with information from `N1`, `N2`, etc., telling + the condition under which the callable should be called. + + * If the target is for AOT: + + * It uses the device configuration file to determine if the condition + expressed by `N1`, `N2`, etc. will be true for this device. + + * If the condition is true, it includes the definition of `@CallableXXX` + and its entire call tree in the device image. + + * If the condition is false, it deletes the call to `@CallableXXX`. + +* The logic for loading a SPIR-V module in the DPC++ runtime is changed. When + the module has associated "add on" device images, the following happens: + + * The runtime queries the device's aspects and architecture and then + evaluates the metadata in the property set for the "add on" device image + to determine whether the associated callable should be called. + + * If the call should be made, the runtime online-links the "add on" device + image that has the application's callable. If the call should not be made, + it online-links the "add on" device image with an empty body. + +### Definition of terms + +The remaining sections provide a more detailed description of the changes in +each component. That description uses the following terms: + +* _Conditional Action_: The application's callable object, which is + conditionally called based on the device's aspects or architecture. The + overview above refers to this as `@CallableXXX`. + +* _Conditional Caller_: A helper function that represents a conditional call to + the _Conditional Action_. The overview above refers to this as the + `@call_if_on_device_conditionallyXXX` function. + +* _Condition Expression_: A sequence of integers representing a boolean + expression tree composed of device aspects, device architectures, and the + the logical operators AND, OR, and NOT. This expression tree determines + whether the _Conditional Caller_ calls the _Conditional Action_. + +### Changes to the device headers + +The metaprogramming in the device headers is best described in code. The end +result of all this metaprogramming are the calls to the templated function +`call_if_on_device_conditionally`, which is the _Conditional Caller_. This +function can be identified later in IR passes by its attribute +`sycl-call-if-on-device-conditionally`. + +There are three important parts to this function: + +* The body of the _Conditional Caller_ contains a single call, which is a + direct call to the _Conditional Action_. + +* The first parameter to the _Conditional Caller_ is the "this" pointer to the + _Conditional Action_. + +* The subsequent parameters to the _Conditional Caller_ are the literal integer + values that define the _Condition Expression_. This expression language is + described in the comments of the code snippet below. + +``` +namespace sycl::ext::oneapi::experimental { +namespace detail { + +// Call the callable object "fn" only when this code runs on a device which +// has a certain set of aspects or a particular architecture. +// +// Condition is a parameter pack of int's that define a simple expression +// language which tells the set of aspects or architectures that the device +// must have in order to enable the call. See the "Condition*" values below. +template +[[__sycl_detail__::add_ir_attributes_function("sycl-call-if-on-device-conditionally", true)]] +void call_if_on_device_conditionally(T fn, Condition...) { + fn(); +} + +// The "Condition" parameter pack above is a sequence of int's that define an +// expression tree. Each node represents a boolean subexpression: +// +// ConditionAspect - Next int is a value from "enum aspect". The +// subexpression is true if the device has this +// aspect. +// ConditionArchitecture - Next int is a value from "enum architecture". The +// subexpression is true if the device has this +// architecture. +// ConditionNot - Next int is the root of another subexpression S1. +// This subexpression is true if S1 is false. +// ConditionAnd - Next int is the root of another subexpression S1. +// The int following that subexpression is the root +// of another subexpression S2. This subexpression +// is true if both S1 and S2 are true. +// ConditionOr - Next int is the root of another subexpression S1. +// The int following that subexpression is the root +// of another subexpression S2. This subexpression +// is true if either S1 or S2 are true. +// +// These values are stored in the application's executable, so they are +// effectively part of the ABI. Therefore, any change to an existing value +// is an ABI break. +// +// There is no programmatic reason for the values to be negative. They are +// negative only by convention to make it easier for humans to distinguish them +// from aspect or architecture values (which are positive). +static constexpr int ConditionAspect = -1; +static constexpr int ConditionArchitecture = -2; +static constexpr int ConditionNot = -3; +static constexpr int ConditionAnd = -4; +static constexpr int ConditionOr = -5; + +// Metaprogramming helper to construct a ConditionAnd expression for a sequence +// of aspects. "ConditionAllAspectsBuilder::seq" is an +// "std::integer_sequence" representing the expression. +template +struct ConditionAllAspectsBuilder; + +template +struct ConditionAllAspectsBuilder { + template + static auto append(std::integer_sequence) { + return std::integer_sequence{}; + } + using rest = typename ConditionAllAspectsBuilder::seq; + static constexpr int asp = static_cast(Aspect); + using seq = decltype( + append(rest{})); +}; + +template +struct ConditionAllAspectsBuilder { + static constexpr int asp = static_cast(Aspect); + using seq = std::integer_sequence; +}; + +// Metaprogramming helper to construct a ConditionOr expression for a sequence +// of architectures. "ConditionAnyArchitectureBuilder::seq" is an +// "std::integer_sequence" representing the expression. +template +struct ConditionAnyArchitectureBuilder; + +template +struct ConditionAnyArchitectureBuilder { + template + static auto append(std::integer_sequence) { + return std::integer_sequence{}; + } + using rest = typename ConditionAnyArchitectureBuilder::seq; + static constexpr int arch = static_cast(Arch); + using seq = decltype( + append(rest{})); +}; + +template +struct ConditionAnyArchitectureBuilder { + static constexpr int arch = static_cast(Arch); + using seq = std::integer_sequence; +}; + +// Metaprogramming helper to construct a ConditionNot expression. +// ConditionNotBuilder::seq" is an "std::integer_sequence" representing +// the expression. +template +struct ConditionNotBuilder { + template + static auto append(std::integer_sequence) { + return std::integer_sequence{}; + } + using rest = typename Exp::seq; + using seq = decltype(append(rest{})); +}; + +// Metaprogramming helper to construct a ConditionAnd expression. +// "ConditionAndBuilder::seq" is an "std::integer_sequence" +// representing the expression. +template +struct ConditionAndBuilder { + template + static auto append(std::integer_sequence, + std::integer_sequence) { + return std::integer_sequence{}; + } + using rest1 = typename Exp1::seq; + using rest2 = typename Exp2::seq; + using seq = decltype(append(rest1{}, rest2{})); +}; + +// Metaprogramming helper to construct a ConditionOr expression. +// "ConditionOrBuilder::seq" is an "std::integer_sequence" +// representing the expression. +template +struct ConditionOrBuilder { + template + static auto append(std::integer_sequence, + std::integer_sequence) { + return std::integer_sequence{}; + } + using rest1 = typename Exp1::seq; + using rest2 = typename Exp2::seq; + using seq = decltype(append(rest1{}, rest2{})); +}; + + +// Helper function to call call_if_on_device_conditionally() while converting +// the "std::integer_sequence" for a condition expression into individual +// arguments of type int. +template +void call_if_on_device_conditionally_helper( + T fn, std::integer_sequence) { + call_if_on_device_conditionally(fn, Is...); +} + +// Helper object used to implement "else_if_device_has" and "otherwise". +// The "MakeCall" template parameter tells whether a previous clause in the +// "if-elseif-elseif ..." chain was true. When "MakeCall" is false, some +// previous clause was true, so none of the subsequent +// "else_if_device_has" or "otherwise" member functions should call the +// user's function. +template +class if_device_has_helper { + public: + template>> + auto else_if_device_has(T fn) { + using make_call_if = ConditionAndBuilder< + MakeCallIf, + ConditionAllAspectsBuilder + >; + using make_else_call_if = ConditionAndBuilder< + MakeCallIf, + ConditionNotBuilder> + >; + + using cond = typename make_call_if::seq; + call_if_on_device_conditionally_helper(fn, cond{}); + return if_device_has_helper{}; + } + + template + void otherwise(T fn) { + using cond = typename MakeCallIf::seq; + call_if_on_device_conditionally_helper(fn, cond{}); + } +}; + +// Same sort of helper object for "else_if_architecture_is". +template +class if_architecture_is_helper { + public: + template>> + auto else_if_architecture_is(T fn) { + using make_call_if = ConditionAndBuilder< + MakeCallIf, + ConditionAnyArchitectureBuilder + >; + using make_else_call_if = ConditionAndBuilder< + MakeCallIf, + ConditionNotBuilder> + >; + + using cond = typename make_call_if::seq; + call_if_on_device_conditionally_helper(fn, cond{}); + return if_architecture_is_helper{}; + } + + template + void otherwise(T fn) { + using cond = typename MakeCallIf::seq; + call_if_on_device_conditionally_helper(fn, cond{}); + } +}; + +} // namespace detail + +template +static auto if_device_has(T fn) { + using make_call_if = detail::ConditionAllAspectsBuilder; + using make_else_call_if = detail::ConditionNotBuilder; + + using cond = typename make_call_if::seq; + detail::call_if_on_device_conditionally_helper(fn, cond{}); + return detail::if_device_has_helper{}; +} + +template +static auto if_architecture_is(T fn) { + using make_call_if = detail::ConditionAnyArchitectureBuilder; + using make_else_call_if = detail::ConditionNotBuilder; + + using cond = typename make_call_if::seq; + detail::call_if_on_device_conditionally_helper(fn, cond{}); + return detail::if_architecture_is_helper{}; +} + +} // namespace sycl::ext::oneapi::experimental +``` + +### New IR pass + +A new LLVM IR pass performs some simple transformations on each of the +_Conditional Caller_ functions. This pass must run before any other passes +that perform optimizations across call boundaries, and it must be run before +the [SYCLPropagateAspectsUsagePass][3] pass. The following transformations +are performed for each _Conditional Caller_ function _FCaller_: + +[3]: <./OptionalDeviceFeatures.md#new-llvm-ir-pass-to-propagate-aspect-usage> + +* The pass scans the body of function _FCaller_ looking for a `call` + instruction. The structure of the header file should ensure that there is + exactly one such instruction. The callee should be a literal (i.e. not a + pointer) because SYCL forbids indirect function calls in device code. If + DPC++ is run in a mode that allows function pointers, we should still forbid + the callable object parameter to `if_device_has` or `if_architecture_is` from + being a function pointer. This callee is the _Conditional Action_ function + _FAction_. + +* For each call to _FCaller_, the pass adds a new parameter at the beginning of + the parameter list, which is the literal function name _FAction_. The + definition of function _FCaller_ is also changed, adding a new formal + parameter whose type is pointer-to-function. + +* The pass deletes the body of _FCaller_ and changes the function definition to + a function declaration. This requires changing the function from internal + linkage to external linkage. + +* Since the function now has external linkage, its name must be unique across + all translation units. The pass therefore renames the function to + `call_if_on_device_conditionally_GUID_N`, where `GUID` comes from the + `-fsycl-unique-prefix` option that is passed to clang and `N` is `1` for the + first function _FCaller_, `2` for the second function _FCaller_, etc. + +At the end of this IR pass, the call sites to the _Conditional Caller_ +functions look like this, and each _Conditional Caller_ function is a function +declaration (not definition). + +``` +call void @call_if_on_device_conditionally_GUID_N(@CallableXXX, %callablethis, + N1, N2, ...) +``` + +After this pass completes, the IR can be optimized by other passes. There is +no danger that the body of the _Conditional Action_ will be inlined into the +_Conditional Caller_ because these call sites have been deleted. There is also +no danger that the parameters to the _Conditional Caller_ calls will be +optimized away because _Conditional Caller_ function is now defined externally. + +Note also that the _Conditional Action_ functions cannot be optimized away +because their addresses are passed to an external function. + +### Normalization of condition expressions + +Although it is not required for correctness, it is beneficial to rewrite the +_Condition Expressions_ at the _Conditional Caller_ call sites in a normalized +form as an additional part of the new IR pass. Normalizing these expressions +causes equivalent expressions to have the same representation. For example, +the condition (_ASP1_ AND _ASP2_) is equivalent to (_ASP2_ AND _ASP1_). If +later IR passes think these are different expressions, it will lead to +unnecessary device code splits. + +TODO: Describe algorithm to normalize expressions. + +### Changes to the aspect propagation pass + +The IR pass [SYCLPropagateAspectsUsagePass][3] propagates the set of aspects +used by each device function up the static call tree to the containing kernel. +We intentionally run this pass after the new pass described above because we do +not want aspects from the application's _Conditional Action_ functions to be +unconditionally propagated to the kernel. These functions are called +conditionally, so the aspects they use are also conditional. + +This IR pass is changed to perform the following additional aspect +propagations: + +* Aspects used by each _Conditional Action_ function (and by functions it + calls) are propagated up to the definition of the _Conditional Action_, and + a `!sycl_used_aspects` metadata is added to the definition of that function. + +* Each call to a _Conditional Caller_ function is considered to conditionally + use the aspects in the associated _Conditional Action_ function. The + condition is the _Condition Expression_ at the call site. + +* Aspect usage can be propagated through nested _Conditional Caller_ function + calls. To illustrate, consider the following example: + + * A _Conditional Caller_ named _FCallerA_ has the _Conditional Action_ named + _FActionA_ and the _Condition Expression_ named _ExpA_. + * The function _FActionA_ calls a different _Conditional Caller_ named + _FCallerB_. + * The call to _FCallerB_ has _Conditional Action_ named _FActionB_ and + _Condition Expression_ named _ExpB_. + + When this happens, aspects used by _FActionA_ have the condition _ExpA_. + Aspects used by _FActionB_ have the condition (_ExpA_ AND _ExpB_). + +* Each kernel or exported device function is decorated with a + `!sycl_used_aspects` metadata telling the aspects that it unconditionally + uses and a `!sycl_conditionally_used_aspects` telling the aspects that it + conditionally uses. See [the section][sec-cond-meta] below for a description + of this second metadata. + +TODO: Describe interaction with the `-fsycl-fixed-targets` command line option. + +### Format of the `!sycl_conditionally_used_aspects` metadata +[sec-cond-meta]: <#format-of-the-sycl_conditionally_used_aspects-metadata> + +This metadata node has the following format: + +``` +define void @foo() !sycl_conditionally_used_aspects !0 {} +!0 = !{!1, !2, ...} ; Each operand is one (condition, aspects) pair +!1 = !{!3, !4} ; First operand is condition, second is aspects +!3 = !{i32 N1, i32 N2, ...} ; Condition is the series of integers in the + ; "Condition Expression" +!4 = !{i32 A1, i32 A2, ...} ; Aspects is a series of aspect numbers +!2 = !{!5, !6} ; Next (condition, aspects) pair +... +``` + +### Simplification of conditionally used aspects + +As described above, the [SYCLPropagateAspectsUsagePass][3] pass adds metadata +nodes named `!sycl_conditionally_used_aspects` to express the aspects that are +conditionally used by each kernel or exported device function. Although it is +not required for correctness, it is beneficial to simplify this metadata by +removing "uninteresting" conditionally used aspects. Doing this can reduce +the amount of device code splitting that occurs later. + +To illustrate, consider an example where the _condition_ is "fp16 == true" and +the _aspects_ is "fp16". In such a case, the conditional aspect usage is +uninteresting because any device where "fp16 == true" will definitely support +the "fp16" aspect. + +In general, a conditionally used aspect (_ExpA_, _A_) can be removed whenever +we can prove that all devices where _ExpA_ is true will always have aspect _A_. +The following algorithm is easy to implement and will catch most of the common +cases: + +* For each (_condition_, _aspects_) pair: + * If _condition_ is an AND expression of several aspects _ASP1_ AND _ASP2_ + ..., remove all occurrences of _ASP1_, _ASP2_, etc. from the _aspects_ + list. + * If _condition_ is an OR expression of several architectures _ARCH1_ OR + _ARCH2_ ..., look at each aspect _ASP_ in the _aspects_ list. Using the + device configuration file, determine if every device architecture _ARCH1_, + _ARCH2_, etc. has aspect _ASP_. If so, remove _ASP_ from the _aspects_ + list. + +After this simplification, some of the (_condition_, _aspects_) pairs may have +an empty set of aspects. When this happens, remove the pair entirely from +`!sycl_conditionally_used_aspects`. This may result in a +`!sycl_conditionally_used_aspects` metadata with no pairs. If this happens, +remove the `!sycl_conditionally_used_aspects` metadata entirely. + +### Changes to the `sycl-post-link` tool (non-AOT) + +DPC++ already invokes `sycl-post-link` separately for AOT vs. non-AOT targets. +When there are both AOT and non-AOT targets specified on the command line, +DPC++ invokes the entire device compiler toolchain (including `sycl-post-link`) +multiple times. Prior to this design, `sycl-post-link` performs the same steps +regardless of the target, but that changes with this design. This section +describes the behavior of `sycl-post-link` when the target is not AOT. + +The device code split algorithm operates normally except for the following +changes: + +* When determining whether two kernels can be placed in the same device image, + the `!used_aspects` must be the same and the + `!sycl_conditionally_used_aspects` must be the same (the same set of + conditions and the same set of conditionally used aspects). + +* When traversing the static call tree of a kernel to form the device image, + calls to the _Conditional Caller_ functions are handled specially. These + calls have a parameter which is a function pointer to the _Conditional + Action_ function. The device code split algorithm does *not* follow these + function pointers when constructing the device image. Thus, the _Conditional + Action_ functions are not included in the device image (unless there is some + call other than from the _Conditional Caller_ functions). + +After device images are split, do the following for each device image: + +* Iterate over each call to a _Conditional Caller_ function. These calls have + the following form: + + ``` + call void @call_if_on_device_conditionally_GUID_N( + @CallableXXX, %callablethis, N1, N2, ...) + ``` + + In the description below, we refer to the _Conditional Action_ + (`@CallableXXX`) as _Action_ and the _Condition Expression_ (`N1`, `N2`, + etc.) as _Condition_. + + * Maintain a table associating three pieces of information: + + * The _Condition_. + * The _Action_ function. + * The set of aspects used by the _Action_ function and its static call tree + (i.e. from the `!sycl_used_aspects` metadata). + * A generation number (described below). + + * If _Action_ is not in the table and if the device image does not already + contain the function _Action_, add it to the table with _Condition_ and set + the generation number to 1. (The only time when the device image already + contains _Action_ is when there is some call to _Action_ other than from a + _Conditional Caller_ function.) + + * If _Action_ is in the table with the same _Condition_, get the generation + number from that table entry. + + * Otherwise, one of two weird things has happened. Either there are two (or + more) calls to _Action_ with different conditions, or there is also a call + to _Action_ from someplace other than a _Conditional Caller_. Both cases + are handled the same way. Choose a generation number such that the name + @CallableXXX.GENERATION is globally unique. Add a new entry to the table + with _Condition_ and that generation number. Duplicate the function + @CallableXXX as @CallableXXX.GENERATION. + + * Change the call to this (if the generation number is 1): + + ``` + call void @CallableXXX(%callablethis) + ``` + + or to this (if the generation number is something else): + + ``` + call void @CallableXXX.GENERATION(%callablethis) + ``` + + * Remember that the device code split algorithm did not include the + definition of the _Action_ function in the image. Therefore, this function + is declared (not defined) with external linkage. + +* The LLVM IR bitcode for this device image is now ready to be emitted. We + call this the _Main Image_. + +* Using the association table described earlier, separate the _Conditional + Action_ functions into groups such that all functions in the same group share + the same _Condition_ and the same set of used aspects. For each such group: + + * Construct a new device image that contains the definition of each + _Conditional Action_ function from the group along with the static call + tree of those functions. This code may make additional calls to the + _Conditional Caller_ functions, passing pointers to yet more _Conditional + Action_ functions. Do not follow these function pointers when constructing + the static call tree. + + * Construct another new device image that contains stub definitions of each + _Conditional Action_ function in the group. Each stub defines the function + with an empty body. + + * The definitions of the the _Conditional Action_ functions in these two + images must have external linkage. We call these new device images the + _Add On Images_. + + * Each pair of _Add On Images_ (i.e. the one with the real function + definitions and the one with the stub definitions) is assigned a unique + name. By convention this is just an integer in string form (e.g. "1"). + + * Create a "SYCL/add on image" property set for each _Add On Image_, as + specified below in the section [New property sets][sec-prop-sets]. + + * Since the _Add On Image_ with the real function definitions may itself + contain calls to a _Conditional Caller_ function, this _Add On Image_ is + added to the list off all device images, such that it is processed as a + _Main Image_ by this algorithm. + +* Add a property set "SYCL/add on images" to the _Main Image_ with one entry + for each pair of _Add On Images_ described above. Each of these properties + contains the associated _Condition Expression_. See the section + [New property sets][sec-prop-sets] for details of this property set. + +* The association table is cleared before processing the next device image. + +After each device image is processed by this algorithm, it goes through the +normal processing of aspects as described in the [Optional Device Features][4] +design, which includes generation of the "SYCL/device requirements" property +set. Note that the `!sycl_conditionally_used_aspects` are not included in this +property set. + +[4]: <./OptionalDeviceFeatures.md#error-checking-for-sycl_external-functions> + +### Changes to the `sycl-post-link` tool (AOT) + +The behavior of the `sycl-post-link` tool is different when it is invoked for +one or more AOT targets. The following occurs for each AOT target before +device code is split into images: + +* Iterate over each call to a _Conditional Caller_ function. These calls have + the following form: + + ``` + call void @call_if_on_device_conditionally_GUID_N( + @CallableXXX, %callablethis, N1, N2, ...) + ``` + + * Using the device configuration file, determine if the _Condition + Expression_ is true for this target. + + * If the condition is true, change the call site into a direct call to the + _Conditional Action_ function like so: + + ``` + call void @CallableXXX(%callablethis) + ``` + + * If the condition is false, remove the call site entirely. + +* Because we did not previously propagate the `!sycl_used_aspects` from the + _Conditional Action_ functions up to their containing kernels, this must be + done now. For each call to a _Conditional Action_ function, add its + `!sycl_used_aspects` set to each kernel that is a predecessor in the static + call tree. + +After this processing is complete, the device code is split into images in the +normal way, as described in the [Optional Device Features][5] design. + +[5]: <./OptionalDeviceFeatures.md#changes-to-the-device-code-split-algorithm> + +### New property sets +[sec-prop-sets]: <#new-property-sets> + +When the `sycl-post-link` tool runs in non-AOT mode, it needs to communicate +information to the DPC++ runtime about the _Add On Images_ that contain the +_Conditional Action_ functions. We do this by adding two new property sets. + +The _Main Image_ contains a property set named "SYCL/add on images" which +contains one property for each of its _Add On Images_. The name of each +property is a unique identifier for the _Add On Image_, which by convention is +just an integer in string form (e.g "1"). The value of the property has type +`PI_PROPERTY_TYPE_BYTE_ARRAY` containing a series of `int32` values `N1`, `N2`, +etc. telling the _Condition Expression_ for the _Conditional Actions_ in this +_Add On Image_. The property's size (which is always divisible by 4) tells the +number of `int32` integers in this property. + +``` +[SYCL/add on images] +"1": N1, N2, N3, ... +"2": N1, N2, N3, ... +``` + +Each _Add On Image_ contains a property set named "SYCL/add on image" which +contains just one property. The name of the property is the unique identifier +for this _Add On Image_ (e.g. "1"). The value of the property has type +`PI_PROPERTY_TYPE_UINT32`. The value is `1` if the _Add On Image_ contains the +real function definitions, and it is `0` if the _Add On Image_ contains the +stub definitions. + +``` +[SYCL/add on image] +"1": 1 +``` + +### Changes to the DPC++ runtime + +The DPC++ runtime requires changes to identify and dynamically link the _Add On +Images_ when necessary. + +When the application submits a kernel to a device, the runtime must already +find the device images from shared libraries that define exported device +functions as described in [Device Code Dynamic Linking][6]. This algorithm is +extended to look also for _Add On Images_. + +If the _Main Image_ contains the "SYCL/add on images" property set, the runtime +does the following for each property in that set: + +* The name of the property is the ID for a pair of _Add On Images_. The + runtime searches the other device images looking at their "SYCL/add on image" + property sets in order to find the _Add On Images_ with the same ID. There + should be two such images: a primary image and a stub image. + +* The value of the property is the _Condition Expression_ gating the + _Conditional Actions_ in this _Add On Image_. The runtime evaluates this + expression for the device to which the kernel is being submitted. + +* The runtime builds a set of _Add On Images_ to link against. If the + expression is true, the runtime adds the primary _Add On Image_ to this set. + If the expression is false, the runtime adds the stub _Add On Image_ to this + set. + +* The selected _Add On Image_ may also contain a "SYCL/add on images" property + set naming yet more _Add On Image_ pairs. If so, the runtime adds these + properties to the iteration set, causing their _Add On Images_ to be found + also. + +Once this completes, the runtime computes the union of the +"SYCL/device requirements" property sets from the _Main Image_, all the _Add On +Images_, and any images from shared libraries defining exported functions. The +runtime then uses this combined set of device requirements to check if the +kernel is compatible with the device as described in the +[Optional Device Features][7] design. + +If the kernel is compatible, the runtime creates a `pi_program` by compiling +and linking all these device images together. + +[6]: <./SharedLibraries.md> +[7]: <./OptionalDeviceFeatures.md#changes-to-the-dpc-runtime> + + +## Alternate design for non-AOT SPIR-V targets + +This is an alternate design that we can consider for SPIR-V targets that are +not AOT compiled. This design proposes a [SPIR-V extension][8] that allows +code to be conditionally selected at JIT compilation time by setting the value +of a specialization constant. The design then uses this feature to select the +_Conditional Actions_ when the program runs by setting specialization +constants. + +Most of the design in this alternative is the same as the primary design +proposed above. The only differences are in the `sycl-post-link` tool, the +property sets, and the DPC++ runtime. + +[8]: <./spirv-extensions/SPV_INTEL_spec_conditional.asciidoc> + +### Changes to the `sycl-post-link` tool (non-AOT) + +The device code split algorithm must still be changed in the same way as the +primary design above describes. After device code is split into images, the +following happens for each image. + +(This part of the design is not very detailed yet and needs to be fleshed out.) + +We need to compute the _Condition Expression_ that gates each of the device +functions. This can be done by traversing the static call tree starting at the +kernel functions. Each kernel function has an empty condition, indicating that +the kernel functions are not gated by any condition. The following rules are +used to determine the condition for each remaining function _F_ in the call +tree: + +* If all of the callers of _F_ have no condition, then _F_ has no condition. + +* Otherwise, the condition for _F_ is the union of the conditions of all of + its callers (i.e _C1_ OR _C2_ ...). + +* A call to a _Conditional Caller_ propagates the _Condition Expression_ to + the _Conditional Action_ function as follows. Consider a function _F_ that + has condition _CF_. Function _F_ calls a _Conditional Caller_ with condition + _CAction_ and a _Conditional Action_ function _G_. The condition for + function _G_ is (_CF_ AND _CAction_). + +Iterate over each call to a _Conditional Caller_ function. These calls have +the following form: + +``` +call void @call_if_on_device_conditionally_GUID_N( + @CallableXXX, %callablethis, N1, N2, ...) +``` + +Change each such call site into a direct call to the _Conditional Action_ +function like so: + +``` +call void @CallableXXX(%callablethis) +``` + +And also record a _Condition Expression_ that gates the call site, which is +the same _Condition Expression_ from the _Conditional Caller_. + +We now have a gating condition for some of the device functions and for some of +the call sites. Each of these conditions must be converted into a SPIR-V +specialization constant expression. The leaves of this expression are +**OpSpecConstantFalse** instructions representing either an aspect or an +architecture. The other nodes in this expression are **OpSpecConstantOp** +instructions representing the AND, OR, and NOT operations in the expression. + +Although it is not necessary for correctness, it is advantageous to optimize +the specialization constant expressions to eliminate duplicate expression +nodes. In particular, it is advantageous to eliminate duplicate leaf nodes +because this will reduce the number of elements in the +"SYCL/special specialization constants" property set (described below), and +thus reduce the amount of work the runtime needs to do when it loads a device +image. + +When generating SPIR-V from the LLVM IR, the following must happen: + +* Each function definition that has a gating condition _C_ must have its + **OpFunction** instruction decorated with **ConditionalINTEL**. + +* Each function call that has a gating condition _C_ (i.e. a + _Conditional Caller_) must have its **OpCall** instruction decorated with + **ConditionalINTEL**. + +* Any capability that is used only from functions decorated with + **ConditionalINTEL** must result in **OpConditionalCapabilityINTEL** rather + than **OpCapability**. + +* Any **OpTypeXXX** instruction that is referenced only from functions + decorated with **ConditionalINTEL** must also be decorated with + **ConditionalINTEL**. + +The `sycl-post-link` tool must also generate the new property set named +"SYCL/special specialization constants", and it must generate new properties +in the "SYCL/device requirements" property set. These properties are +described in the [section below][sec-alt-prop-sets]. + +The "SYCL/special specialization constants" property set is constructed from +the leaf nodes of the SPIR-V specialization constant expressions described +above. Leaf nodes that represent aspects are added to the "aspects" property, +and leaf nodes that represent architectures are added to the "architectures" +property + +The device code split algorithm ensures that all kernels in the same device +image have the same value for their `!sycl_conditionally_used_aspects` +metadata. The value of this metadata can be used to create the +"conditional-aspects-_N_" and "condition-_N_" properties. + +### New properties and property sets +[sec-alt-prop-sets]: <#new-properties-and-property-sets> + +When the `sycl-post-link` tool runs in non-AOT mode, it needs to communicate +information to the DPC++ runtime about the specialization constants that must +be set in order to select the _Conditional Action_ functions. We do this by +adding a new property set named "SYCL/special specialization constants". There +are two properties in this set, one named "aspects" and another named +"architectures". The value of each has type `PI_PROPERTY_TYPE_BYTE_ARRAY` +containing a series of `uint32` values. The property's size (which is always +divisible by 4) tells the number of `uint32` integers. + +The property "aspects" is a series of pairs (_SpecID_, _aspect_), where +_SpecID_ is the ID of an **OpSpecConstantFalse** instruction representing the +aspect whose numerical value in `enum aspect` is _aspect_. The property +"architectures" is a series of pairs (_SpecID_, _arch_), where _SpecID_ is the +ID of an **OpSpecConstantFalse** instruction representing the architecture +whose numerical value in `enum architecture` is _arch_. + +``` +[SYCL/special specialization constants] +"aspects": SpecID1, aspect1, SpecID2, aspect2, ... +"architectures": SpecID3, arch1, SpecID4, arch2, ... +``` + +The `sycl-post-link` tool must also communicate information to the DPC++ +runtime about the aspects that are conditionally used in an image. We do this +by adding new properties to the existing "SYCL/device requirements" property +set. These properties have names of the form "condition-_N_" and +"conditional-aspects-_N_", where _N_ is an integer in string form. The value +of a "condition-_N_" property has type `PI_PROPERTY_TYPE_BYTE_ARRAY` containing +a series of `int32` values representing a _Condition Expression_. The value +of a "conditional-aspects-_N_" property has type `PI_PROPERTY_TYPE_BYTE_ARRAY` +containing a series of `uint32` values representing aspect values. In both +cases, the property's size (which is always divisible by 4) tells the number +of values. + +These properties are grouped in pairs, matched by the value of _N_. For +example, "condition-0" is matched with "conditional-aspects-0", etc. The +"conditional-aspects-_N_" property tells the set of aspects that the image uses +when the device satisfies the condition expressed in "condition-_N_". + +``` +[SYCL/device requirements] +"condition-0": N1, N2, N3, ... +"conditional-aspects-0": aspect1, aspect2, ... +"condition-1": N4, N5, N6, ... +"conditional-aspects-1": aspect3, aspect4, ... +``` + +### Changes to the DPC++ runtime + +The DPC++ runtime changes in several ways. Before creating a `pi_program` from +a device image, the runtime examines the "SYCL/device requirements" property +set to see if the aspects used by the image are compatible with the device. +This logic is extended to make use of the "conditional-aspects-_N_" properties. +The runtime evaluates each _Condition Expression_ in "condition-_N_" and adds +the associated aspects to the requirements if the expression is true. Existing +code in the DPC++ runtime then checks that the device has all of these aspects +before attempting to create a `pi_program` from the device image. + +After creating the `pi_program`, the runtime examines the aspect and +architecture values in the "SYCL/special specialization constants" property +set. Whenever the device has the aspect or architecture, the runtime calls +`piextProgramSetSpecializationConstant` to set the associated _SpecID_ to true. +Setting these specialization constants causes the _Conditional Actions_ to be +selected according to the device capabilities. + +Ideally, each backend UMD (e.g. Level Zero, OpenCL) would support the proposed +[SPV\_INTEL\_spec\_conditional][8] extension. However, if these backends +cannot be updated in time, the runtime can implement a preprocessing step as a +SPIR-V to SPIR-V translator that does the code specialization. The output of +this translator can then be sent to the backend UMD even if it does not support +SPV\_INTEL\_spec\_conditional. + + +## Alternate design for AOT targets + +This is an alternate design that we can consider for targets that are AOT +compiled. This alternate design is orthogonal to the alternate design +presented above for non-AOT targets, so we could implement either alternate +design on its own or both together. + +In any AOT design, we need to resolve the _Conditional Actions_ at some point +in the AOT compilation. In the primary design, this is done in +`sycl-post-link`. This has both advantages and disadvantages. The +disadvantage is that we may lose some LLVM IR optimizations because the +optimization passes (which run before `sycl-post-link`) operate on the code +before the _Conditional Actions_ are resolved. As an example, the optimizer +won't be able to inline the body of these _Conditional Action_ functions at +their call sites. We think this will be mitigated for Intel GPU targets, +though, because `ocloc` (which runs after `sycl-post-link`) can perform these +optimizations. + +As an alternative design, we could instead resolve the _Conditional Actions_ in +an LLVM IR pass that happens much earlier in the pipeline. This would allow +the LLVM optimization passes to better optimize the code because the +_Conditional Action_ functions will either be direct calls or they will be +entirely eliminated by the time the optimizer passes run. If we do this, it +makes sense to resolve the _Conditional Actions_ sometime before the +[SYCLPropagateAspectsUsagePass][3] runs. The algorithm would be similar to +what `sycl-post-link` does in the primary design: + +* Iterate over each call to a _Conditional Caller_ function. These calls have + the following form: + + ``` + call void @call_if_on_device_conditionally_GUID_N( + @CallableXXX, %callablethis, N1, N2, ...) + ``` + +* Using the device configuration file, determine if the _Condition + Expression_ is true for this target. + +* If the condition is true, change the call site into a direct call to the + _Conditional Action_ function like so: + + ``` + call void @CallableXXX(%callablethis) + ``` + +* If the condition is false, remove the call site entirely. + +The SYCLPropagateAspectsUsagePass will then propagate aspects used by the +remaining _Conditional Actions_ up to the kernel functions as +`!sycl_used_aspects` metadata (i.e. unconditionally used aspects). + +The disadvantage to this design is that it increases compilation time when +there are multiple targets. Once the _Conditional Actions_ are resolved, +the LLVM IR is now specialized for one particular AOT target. If the user has +asked to compile for multiple targets, we need to split the IR prior to +resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for +each copy of the IR. + +It would be possible to use this alternate design for some AOT targets but not +others. For example, we may decide that `ocloc` can mitigate any lost +optimizations for Intel GPU targets, so the primary design may make sense for +those targets. However, we might decide that other targets (e.g. Nvidia) work +better with the alternate design. diff --git a/sycl/doc/design/spirv-extensions/SPV_INTEL_spec_conditional.asciidoc b/sycl/doc/design/spirv-extensions/SPV_INTEL_spec_conditional.asciidoc new file mode 100644 index 0000000000000..83e02e65e45dc --- /dev/null +++ b/sycl/doc/design/spirv-extensions/SPV_INTEL_spec_conditional.asciidoc @@ -0,0 +1,244 @@ +:extension_name: SPV_INTEL_spec_conditional +:capability_name: SpecConditionalINTEL +:capability_token: TBD +:conditional_token: TBD +:conditional_capability_token: TBD +:conditional_name: ConditionalINTEL +:conditional_capability_name: OpConditionalCapabilityINTEL + +{extension_name} +================ + +== Name Strings + +{extension_name} + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm + +== Contributors + +// spell-checker: disable +- Greg Lueck, Intel + +// spell-checker: enable + +== Notice + +Copyright (c) 2023 Intel Corporation. All rights reserved. + +== Status + +Draft + +The Overview section should be expanded and a description of the validation +rules should be added. + +== Version + +[width="40%",cols="25,25"] +|======================================== +| Last Modified Date | {docdate} +| Revision | 1 +|======================================== + +== Dependencies + +This extension is written against the SPIR-V Specification, Version 1.6 Revision +2. + +This extension requires SPIR-V 1.0. + +== Overview + +This extension provides the ability to conditionally exclude (or specialize) +code from a SPIR-V module based on the value of a specialization constant. The +primary motivation for this extension is to allow offload kernels to make use +of special hardware features that are available on some devices while also +providing a fallback path when the code runs on a device that does not support +those features. However, this extension could be useful in other execution +environments also. + +The granularity of code specialization is primary at the function boundary, +which is convenient because tools generating SPIR-V may need to outline each +version of the code into separate functions anyways. + +== Extension Name + +To use this extension within a SPIR-V module, the appropriate *OpExtension* must +be present in the module: + +[subs="attributes"] +---- +OpExtension "{extension_name}" +---- + +== New Capabilities + +This extension introduces new capabilities: + +[subs="attributes"] +---- +{capability_name} +---- + +== New Decorations + +Decorations added under the *{capability_name}* capability: + +[subs="attributes"] +---- +{conditional_name} +---- + +== New Instructions + +Instructions added under the *{capability_name}* capability: + +[subs="attributes"] +---- +{conditional_capability_name} +---- + +== Token Number Assignments + +[width="40%"] +[cols="70%,30%"] +[grid="rows"] +|==== +|*{capability_name}* | {capability_token} +|*{conditional_name}* | {conditional_token} +|*{conditional_capability_name}* | {conditional_capability_token} +|==== + +== Modifications to the SPIR-V Specification, Version 1.6 + +=== Logical Layout of a Module + +Modify section 2.4, Logical Layout of a Module, changing item 1 in the layout +list to say: + +1. All *OpCapability* and *{conditional_capability_name}* instructions. If + there are any *{conditional_capability_name}* instructions in this section, + they must come after the *OpCapability* for *{capability_name}*. + +=== Specialization + +Modify section 2.12, Specialization, adding the following rules to the +specialization algorithm: + +* If the 'Condition' for an *{conditional_capability_name}* instruction is a + non-specialization constant: +** If the 'Condition' is *false*, the *{conditional_capability_name}* + instruction is removed. +** If the 'Condition' is *true*, the *{conditional_capability_name}* + instruction is replaced with *OpCapability* with the same capability. + (Note this is not a replace-in-place operation because the instructions have + different sizes.) + +* If the 'Condition' for an *{conditional_name}* decoration is a + non-specialization constant whose value is *true*, the decoration itself is + removed. If the 'Condition' is a non-specialization constant whose value is + *false*, the following happens for the decorated instruction: +** If the decorated instruction is not *OpFunction*, the instruction is + removed. All decorations for the instruction are also removed, and any + *OpName* or *OpMemberName* referencing the instruction are removed. +** If the decorated instruction is *OpFunction*, the function and all of the + instructions in its body are removed. All decorations for the function and + the instructions in its body are removed. If the *OpFunction* has an + associated *OpEntryPoint*, that is also removed. Any *OpName* or + *OpMemberName* that references the *OpFunction* or any of the instructions + in its body are also removed. + +=== Capabilities + +Modify Section 3.31, Capability, adding rows to the Capability table: + +-- +[options="header"] +|==== +2+^| Capability ^| Implicitly Declares +| {capability_token} | *{capability_name}* +| +|==== +-- + +=== Decorations + +Modify Section 3.20, Decoration, adding rows to the Decoration table: + +-- +[cols="1,5,2,2",options="header"] +|==== +2+^| Decoration ^| Extra Operands | Enabling Capabilities + +// --- ROW BREAK --- +| {conditional_token} +a| +*{conditional_name}* + +May be applied only to *OpFunction*, *OpFunctionCall*, global (module scope) +*OpVariable*, type declarations (*OpTypeXXX*), or constant instructions +(*OpConstantXXX* or *OpSpecConstantXXX*). The 'Condition' must be the result +of a specialization constant of scalar 'Boolean type'. + +Indicates that the decorated instruction must be removed during specialization +if the value of the specialization constant identified by 'Condition' is +*false*. The Specialization section describes the rules for removing +instructions decorated this way. + +| '' + +'Condition' +| *{capability_name}* + +|==== +-- + +=== Instructions + +Add to Section 3.42.5, Mode-Setting Instructions: + +[cols="1,1,2*3",width="100%"] +|===== +3+|[[OpConditionalCapability]]*{conditional_capability_name}* + + + +Declare a capability that is conditionally used by this module, depending on +the value of a specialization constant. + +The 'Condition' must be the result of a specialization constant of scalar +'Boolean type'. + +The 'Conditional Capability' is used by this module only if the specialization +constant identified by 'Condition' is *true*. + +1+|Capability: + +*{capability_name}* +1+| 3 | {conditional_capability_token} +| '' + +'Condition' +| 'Capability' + +'Conditional Capability' +|===== + +== Issues + +1. Does it make sense to place *{conditional_capability_name}* instructions in + the same layout section as *OpCapability*? This means they will come before + the *OpExtension* that declares this extension. The alternative is to place + them later in the layout, but this could make specialization harder. The + nice thing about the current position is that the specializer can just + replace *{conditional_capability_name}* with *OpCapability* without changing + its position. ++ +*UNRESOLVED* + +== Revision History + +[cols="5,15,15,70"] +[grid="rows"] +[options="header"] +|======================================== +|Rev|Date|Author|Changes +|1|2023-04-19|Greg Lueck|Initial draft +|========================================