From a27bcbf4676c708a2a97a19dd031952805e9cc0c Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Fri, 29 Jan 2021 18:43:21 +0300 Subject: [PATCH 01/18] [Doc] Add design doc for shared device libraries feature --- sycl/doc/SharedLibraries.md | 276 ++++++++++++++++++ .../images/LD-preload-shared-libraries.svg | 1 + sycl/doc/images/ODR-shared-libraries.svg | 1 + 3 files changed, 278 insertions(+) create mode 100755 sycl/doc/SharedLibraries.md create mode 100755 sycl/doc/images/LD-preload-shared-libraries.svg create mode 100755 sycl/doc/images/ODR-shared-libraries.svg diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md new file mode 100755 index 0000000000000..72649d8f4f605 --- /dev/null +++ b/sycl/doc/SharedLibraries.md @@ -0,0 +1,276 @@ +# Shared DPC++ libraries + +This document describes purpose and design of Shared DPC++ libraries feature. + +## Background +Sometimes users want to provide *device* functions via shared libraries. +Simple source example: +``` +// App: + +CGH.parallel_for(/* ... */ { + library_function(); +}); + + +// Shared library: +SYCL_EXTERNAL void library_function() { + // do something +} +``` +It is possible to manually create `sycl::program` in both app and shared +library, then use `link` SYCL API to get a single program and launch kernels +using it. But it is not user-friendly and it is very different from regular +C/C++ workflow. + +The main purpose of this feature is to provide a mechanism which allows to +provide *device* functions via shared libraries and works as close as possible +to regular shared libraries. + +## Requrements: +User's code is compiled into a shared library which consists of some host API, +device code and device API (`SYCL_EXTERNAL` functions). The library is linked to +a user's application which also contains some device code and performs +computations using DPC++/SYCL. +For this combination the following statements must be true: + +- `SYCL_EXTERNAL` functions from library can be called (directly or indirectly) + from device code of the application. +- Function pointers taken in application should work inside the library. +- Specific code changes are not required, i.e. the mechanism of linking works + as close as possible to regular shared libraries. + +## Design +The overall idea is simple: + +- Each device image is supplied with an information about exported and imported + symbols using device image properties +- DPC++ RT performs *device images collection* task by grouping all device + images required to execute a kernel based on the list of exports/imports + - Besides symbol names, additional attributes are taken into account (like + device image format: SPIR-V or device asm) +- Actual linking is performed by underlying backend (OpenCL/L0/etc.) + +Next sections describe details of changes in each component. + +### DPC++ front-end changes + +DPC++ front-end generates `module-id` attribute on each `SYCL_EXTERNAL` function. +It was generated only on kernels earlier. There are two reasons to start +generating this attribute on `SYCL_EXTERNAL` functions: + +- Later in pipeline, this attribute will be used by `sycl-post-link` tool to + separate `SYCL_EXTERNAL` functions from non-`SYCL_EXTERNAL` functions with + external linkage. +- `module-id` attribute also contains information about source file where the + function comes from. This information will be used to perform device code + split on device images that contain only exported functions. + +### sycl-post-link changes + +`sycl-post-link` performs 3 important tasks: +- Arranges `SYCL_EXTERNAL` functions into a separate device image(s) +- Supplies device images containing exports with an information about exported + symbols +- Supplies each device image with an information about imported symbols + +`sycl-post-link` outlines `SYCL_EXTERNAL` functions with all their reachable +dependencies (functions with definitions called from `SYCL_EXTERNAL` ones) +into a separate device image(s) in order to create minimal self-contained +device images that can be linked from the user's app. There are several +notable moments though. + +If a `SYCL_EXTERNAL` function is used within a kernel defined in a shared +library, it will be duplicated: one instance will be stored in the kernel's +device image and the function won't exported from this device image, while the +other will be stored in a special device image for other `SYCL_EXTERNAL` +functions and will be marked as exported there. Such duplication is need for +two reasons: +- We aim to make device images with kernels self-contained so no JIT linker + invocations would be needed if we have definitions of all called functions. + Also note that if AOT is requested, it would be impossible to link anything + at runtime. +- We could export `SYCL_EXTERNAL` functions from device images with kernels, + but it would mean that when user's app calls `SYCL_EXTERNAL` function, it has + to link a whole kernel and all its dependencies - not only it increases the + amount of unnecessary linked code, but might also lead to build errors if the + kernel uses some features, which are not supported by target device (and they + are not used in the `SYCL_EXTERNAL` function). +Besides separating `SYCL_EXTERNAL` functions from kernels, they can be further +split into separate device images if device code split is requested. This is +done by grouping them using `module-id` attribute. Non-`SYCL_EXTERNAL` functions +used by `SYCL_EXTERNAL` functions with different `module-id` attributes are +copied to device images corresponding to those `SYCL_EXTERNAL` functions +to make them self-contained +In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function +with different `module-id` attribute, the second one is not copied to the +device image with the first function, but dependency between those device images +is recorder instead. + +After `SYCL_EXTERNAL` functions are arranged into a separate device image(s), +all non-`SYCL_EXTERNAL` functions are internalized to avoid multiple definition +errors during runtime linking. +Device images with `SYCL_EXTERNAL` functions will also have a list of names +of exported functions. + +**NOTE**: If device code split is enabled, it seems reasonable to perform +exports arrangement before device code split procedure. + +In orger to collect information about imported symbols `sycl-post-link` looks +through LLVM IR and for each declared but not defined symbol records its name, +except the following cases: +- Declarations with `__` prefix in demangled name are not recorded as imported + functions + - Declarations with `__spirv_*` prefix should not be recorded as dependencies + since they represent SPIR-V operations and will be transformed to SPIR-V + instructions during LLVM->SPIR-V translation. +- Based on some attributes which could be defined later + - This is needed to have possibility to call device-specific builtins not + starting with `__` by forward-declaring them in DPC++ code + +**NOTE**: If device code split is enabled, imports collection is performed after +split and it is performed on splitted images. + +All collected information is attached to a device image via properties +mechanism. + +Each device image is supplied with an array of property sets: +``` +struct pi_device_binary_struct { +... + // Array of property sets + pi_device_binary_property_set PropertySetsBegin; + pi_device_binary_property_set PropertySetsEnd; +}; + +``` +Each property set is represent by the following struct: +``` +// Named array of properties. +struct _pi_device_binary_property_set_struct { + char *Name; // the name + pi_device_binary_property PropertiesBegin; // array start + pi_device_binary_property PropertiesEnd; // array end +}; +``` +It contains name of property set and array of properties. Each property is +represented by the following struct: +``` +struct _pi_device_binary_property_struct { + char *Name; // null-terminated property name + void *ValAddr; // address of property value + uint32_t Type; // _pi_property_type + uint64_t ValSize; // size of property value in bytes +}; +``` + +List of imported symbols is represented as a single property set with name +`ImportedSymbols` recorded in the `Name` field of property set. +Each property in this set holds name of the particular imported symbol recorded +in the `Name` field of the property. +List of exported symbols is represented in the same way, except the +corresponding set has the name `ExportedSymbols`. + +### DPC++ runtime changes + +DPC++ RT performs *device images collection* task by grouping all device +images required to execute a kernel based on the list of exports/imports and +links them together using PI API. + +Given that all exports will be arranged to a separate device images without +kernels it is reasonable to store device images with exports in a separate data +structure. + +## Corner cases and limitations + +It is not guaranteed that behaviour of host shared libraries and device shared +libraries will always match. There are several cases when it can occur, the +next sections will cover details of such cases. + +### ODR violations + +C++ standard defines One Definition Rule as: +> Every program shall contain exactly one definition of every non-inline + function or variable that is odr-used in that program outside of a discarded + statement; no diagnostic required. + The definition can appear explicitly in the program, it can be found in the + standard or a user-defined library, or (when appropriate) it is implicitly + defined. + + +Here is an example: + +![ODR violation](images/ODR-shared-libraries.svg) + +Both libraries libB and libC provide two different definitions of function +`b()`, so this example illustrates ODR violation. Technically this case has +undefined behaviour, however it is possible to run and compile this example on +Linux and Windows. Whereas on Linux only function `b()` from library libB is +called, on Windows both versions of function `b()` are used. +Most of backends online linkers act like static linkers, i.e. just merge +device images with each other, so it is not possible to correctly imitate +Windows behaviour in device code linking because attempts to do it will result +in multiple definition errors. + +Given that, it is not guaranteed that behaviour of shared host libraries and +shared device libraries will always match in case of such ODR violations. + +#### LD_PRELOAD + +Another way to violate ODR is `LD_PRELOAD` environment variable on Linux. It +allows to load specified shared library before any other shared libraries so it +will be searched for symbols before other shared libraries. It allows to +substitute functions from regular shared libraries by functions from preloaded +library. +Device code registration is implemented using global constructors. Order of +global constructors calling is not defined across different translation units, +so with current design of device shared libraries and device code registration +mechanism it is not possible to understand which device code comes from +preloaded library and which comes from regular shared libraries. + +Here is an example: + +![LD_PRELOAD](images/LD-preload-shared-libraries.svg) + +"libPreload" library is preloaded using `LD_PRELOAD` environment variable. +In this example, device code from "libPreload" might be registered after +device code from "libA". + +To implement basic support, for each device image we can record name of the +library where this device image comes from and parse content of `LD_PRELOAD` +environment variable to choose the proper images. However such implementation +will only allow to substitute a whole device image and not an arbitrary +function (unless it is the only function in a device image), because partial +substitution will cause multiple definition errors during runtime linking. + +### Run-time libraries loading + +It is possible to load shared library during run-time. Both Linux and Windows +provide a way to do so (for example `dlopen()` on Linux or `LoadLibrary` on +Windows). +In case run-time loading is used to load some shared library, the symbols from +this shared library do not appear in the namespace of the main program. It means +that even though shared library is loaded successfully in run-time, it is not +possible to use symbols from it directly. The symbols from run-time loaded +library can be accessed by address which can be obtained using corresponding +OS-dependent API (for example `dlsym()` on Linux). + +The problem here is that even though symbols from run-time loaded shared +library are not part of application's namespace, the library is loaded through +standard mechanism, i.e. its global constructors are invoked which means that +device code from it is registered, so it is not possible to +understand whether device code comes from run-time loaded library or not. +If such run-time loaded library exports device symbols and they +somehow match with symbols that actually directly used in device code +somewhere, it is possible that symbols from run-time loaded library +will be unexpectedly used. + +To resolve this problem we need to ensure that device code registered from +run-time loaded library appears at the end of symbols search list, however +having that device code registration is triggered by global constructors, it +doesn't seem possible. + +One more possible mitigation would be to record name of the library from which +each symbol should be imported, but it still won't resolve all potential +issues with run-time library loading, because user can load the library with the +same name as one of the explicitly linked libraries. diff --git a/sycl/doc/images/LD-preload-shared-libraries.svg b/sycl/doc/images/LD-preload-shared-libraries.svg new file mode 100755 index 0000000000000..a70103cb47da8 --- /dev/null +++ b/sycl/doc/images/LD-preload-shared-libraries.svg @@ -0,0 +1 @@ +Applicationa();b();libPreloadvoida(){}libAvoida(){}void b() {} \ No newline at end of file diff --git a/sycl/doc/images/ODR-shared-libraries.svg b/sycl/doc/images/ODR-shared-libraries.svg new file mode 100755 index 0000000000000..3ec8a7dde077b --- /dev/null +++ b/sycl/doc/images/ODR-shared-libraries.svg @@ -0,0 +1 @@ +Applicationa();b();libAvoid a() { b(); }libBvoid b() { //2 }libCvoid b() { // 1 } \ No newline at end of file From 6fe222d7c4350d2917a5236649b8aca130660bbc Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Thu, 18 Feb 2021 10:24:37 +0300 Subject: [PATCH 02/18] Apply suggestions from code review Co-authored-by: kbobrovs Co-authored-by: Alexey Sachkov --- sycl/doc/SharedLibraries.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 72649d8f4f605..b2630327a855a 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -68,7 +68,7 @@ generating this attribute on `SYCL_EXTERNAL` functions: ### sycl-post-link changes -`sycl-post-link` performs 3 important tasks: +To support dynamic device linkage, `sycl-post-link` performs 3 main tasks: - Arranges `SYCL_EXTERNAL` functions into a separate device image(s) - Supplies device images containing exports with an information about exported symbols @@ -110,8 +110,9 @@ is recorder instead. After `SYCL_EXTERNAL` functions are arranged into a separate device image(s), all non-`SYCL_EXTERNAL` functions are internalized to avoid multiple definition errors during runtime linking. -Device images with `SYCL_EXTERNAL` functions will also have a list of names -of exported functions. +Device images with `SYCL_EXTERNAL` functions will also get a list of names +of exported functions attached to them through device image properties +(described below). **NOTE**: If device code split is enabled, it seems reasonable to perform exports arrangement before device code split procedure. @@ -124,7 +125,8 @@ except the following cases: - Declarations with `__spirv_*` prefix should not be recorded as dependencies since they represent SPIR-V operations and will be transformed to SPIR-V instructions during LLVM->SPIR-V translation. -- Based on some attributes which could be defined later +- Based on some attributes (which could be defined later) we may want to avoid + listing some functions as imported ones - This is needed to have possibility to call device-specific builtins not starting with `__` by forward-declaring them in DPC++ code From 604909c787b6ad86faa72b5f004c1b1a9fab836b Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Thu, 18 Feb 2021 10:31:07 +0300 Subject: [PATCH 03/18] Apply minor comment, fix a typo --- sycl/doc/SharedLibraries.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index b2630327a855a..1279e1e68fa17 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -96,12 +96,12 @@ two reasons: amount of unnecessary linked code, but might also lead to build errors if the kernel uses some features, which are not supported by target device (and they are not used in the `SYCL_EXTERNAL` function). -Besides separating `SYCL_EXTERNAL` functions from kernels, they can be further -split into separate device images if device code split is requested. This is -done by grouping them using `module-id` attribute. Non-`SYCL_EXTERNAL` functions -used by `SYCL_EXTERNAL` functions with different `module-id` attributes are -copied to device images corresponding to those `SYCL_EXTERNAL` functions -to make them self-contained +Besides separating `SYCL_EXTERNAL` functions from kernels, `sycl-post-link` +can also distribute those functions into separate device images if device code +split is requested. This is done by grouping them using `module-id` attribute. +Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions with different +`module-id` attributes are copied to device images corresponding to those +`SYCL_EXTERNAL` functions to make them self-contained. In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function with different `module-id` attribute, the second one is not copied to the device image with the first function, but dependency between those device images @@ -117,7 +117,7 @@ of exported functions attached to them through device image properties **NOTE**: If device code split is enabled, it seems reasonable to perform exports arrangement before device code split procedure. -In orger to collect information about imported symbols `sycl-post-link` looks +In order to collect information about imported symbols `sycl-post-link` looks through LLVM IR and for each declared but not defined symbol records its name, except the following cases: - Declarations with `__` prefix in demangled name are not recorded as imported From df953fceaa8111250625145a9c8087f9ba43313a Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Fri, 5 Mar 2021 18:50:12 +0300 Subject: [PATCH 04/18] Rename the feature, add runtime section --- sycl/doc/SharedLibraries.md | 149 ++++++++++++++++++++++++++++++------ 1 file changed, 125 insertions(+), 24 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 1279e1e68fa17..28674f88a4aaa 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -1,9 +1,11 @@ -# Shared DPC++ libraries +# Dynamic linking of device code -This document describes purpose and design of Shared DPC++ libraries feature. +This document describes purpose and design of dynamic linking of device code +feature. ## Background -Sometimes users want to provide *device* functions via shared libraries. +Sometimes users want to link device code dynamically at run time. One possible +use case for such linkage - providing device functions via shared libraries. Simple source example: ``` // App: @@ -23,30 +25,50 @@ library, then use `link` SYCL API to get a single program and launch kernels using it. But it is not user-friendly and it is very different from regular C/C++ workflow. +Another possible scenario - use functions defined in pre-compiled device image +provided by user. Example: +``` +// a.cpp +SYCL_EXTERNAL void foo(); +... +parallel_for([]() { foo(); }); + +// b.cpp +/*no SYCL_EXTERNAL*/ void foo() { ... } +``` +We have a `SYCL_EXTERNAL` function `foo` called from a kernel, but the +application defined only host version of this function. Then user adds device +image with definition of `foo` to the fat object via special option. + The main purpose of this feature is to provide a mechanism which allows to -provide *device* functions via shared libraries and works as close as possible -to regular shared libraries. +link device code dynamically at runtime. ## Requrements: -User's code is compiled into a shared library which consists of some host API, -device code and device API (`SYCL_EXTERNAL` functions). The library is linked to -a user's application which also contains some device code and performs -computations using DPC++/SYCL. +User's device code that consists of some device API (`SYCL_EXTERNAL` functions), +is compiled into some form and it is not linked statically with device code of +application. It can be a shared library that contains some device code or a +separate device image supplied with property information. This code is linked +dynamically at run time with device code of a user's application in order to +resolve dependencies. For this combination the following statements must be true: -- `SYCL_EXTERNAL` functions from library can be called (directly or indirectly) - from device code of the application. -- Function pointers taken in application should work inside the library. +- `SYCL_EXTERNAL` functions defined in dynamically linked code can be called + (directly or indirectly) from device code of the application. +- Function pointers taken in application ashould work inside the dynamically + linked code. - Specific code changes are not required, i.e. the mechanism of linking works as close as possible to regular shared libraries. ## Design -The overall idea is simple: - -- Each device image is supplied with an information about exported and imported - symbols using device image properties -- DPC++ RT performs *device images collection* task by grouping all device - images required to execute a kernel based on the list of exports/imports +The overall idea: + +- Each device image is supplied with a list of imported symbol names + through device image properties mechanism +- `SYCL_EXTERNAL` functions are arranged into separate device images supplied + with a list of exported symbol names +- Before compiling a device image DPC++ RT will check if device image has a list + of imported symbols and if it has, then RT will search for device images which + define required symbols using lists of exported symbols. - Besides symbol names, additional attributes are taken into account (like device image format: SPIR-V or device asm) - Actual linking is performed by underlying backend (OpenCL/L0/etc.) @@ -108,7 +130,8 @@ device image with the first function, but dependency between those device images is recorder instead. After `SYCL_EXTERNAL` functions are arranged into a separate device image(s), -all non-`SYCL_EXTERNAL` functions are internalized to avoid multiple definition +all non-`SYCL_EXTERNAL` functions and `SYCL_EXTERNAL` functions left in device +images with kernels marked with internal linkage to avoid multiple definition errors during runtime linking. Device images with `SYCL_EXTERNAL` functions will also get a list of names of exported functions attached to them through device image properties @@ -167,11 +190,11 @@ struct _pi_device_binary_property_struct { ``` List of imported symbols is represented as a single property set with name -`ImportedSymbols` recorded in the `Name` field of property set. +`SYCL/imported symbols` recorded in the `Name` field of property set. Each property in this set holds name of the particular imported symbol recorded in the `Name` field of the property. List of exported symbols is represented in the same way, except the -corresponding set has the name `ExportedSymbols`. +corresponding set has the name `SYCL/exported symbols`. ### DPC++ runtime changes @@ -179,9 +202,87 @@ DPC++ RT performs *device images collection* task by grouping all device images required to execute a kernel based on the list of exports/imports and links them together using PI API. -Given that all exports will be arranged to a separate device images without -kernels it is reasonable to store device images with exports in a separate data -structure. +#### Device images collection + +DPC++ Runtime class named ProgramManager stores device images using following +data structure: +``` +/// Keeps all available device executable images added via \ref addImages. +/// Organizes the images as a map from a kernel set id to the vector of images +/// containing kernels from that set. +/// Access must be guarded by the \ref Sync::getGlobalLock() +std::unordered_map>> + m_DeviceImages; + +using StrToKSIdMap = std::unordered_map; +/// Maps names of kernels from a specific OS module (.exe .dll) to their set +/// id (the sets are disjoint). +std::unordered_map m_SymbolSets; +``` +Assume each device image represents some combination of symbols and different +device images may contain only exactly the same or not overlapping combination +of symbols. If it is not so, there can be two cases: + - Symbols are the same. In this case it doesn't matter which device image is + taken to use duplicated symbol + - Symbols are not the same. In this case ODR violation takes place, such + situation leads to undefined behaviour. For more details refer to + [ODR violations](#ODR-violations) section. + +Each combination of symbols is assigned with an Id number - symbol set Id. +A combination of symbols can exist in different formats (i.e. SPIR-V/AOT +compiled binary and etc). +`m_DeviceImages` maps an Id number to an array with device images which represent +the same combination of symbols in different formats. +`m_SymbolSets` contains mapping from symbol name to symbol set Id for each OS +module (.exe/.so/.dll). +`std::unordered_map` allows to search and access its elements with constant-time +complexity. + +Before compilation of device image to execute a kernel RT checks if the image +contains any import information in its properies and if it does, then RT +performs device images collection in order to resolve dependencies. + +Ids of all needed symbol sets are found. This is done by iterating through +`m_SymbolSets` map, i.e. iterating through all available OS modules without +predefined order and searching for first unresolved symbol in list of imports +set of target device image. Once device image that contains first symbol is +met, remaining exported symbols are checked in found image and if +they match some imported symbols then these matched symbols will be marked as +resolved. The procedure repeats until all imported symbols are resolved. +For each found symbol set Id program cache is checked in case if +necessary set of `SYCL_EXTERNAL` functions has been compiled and if it is true, +then compiled device image will be re-used for linking. +Otherwise device image containing required symbols set will be compiled and +stored in cache. + +#### Program caching + +Existing support for device code caching is re-used to cache programs created +from device images with SYCL external functions and linked device images with +imports information. + +##### In-memory cache + +Programs that contain only `SYCL_EXTERNAL` functions will be cached only in +compiled state, so they can be linked with other programs during dependency +resolution. + +The existing mechanism of caching is not changed for programs with +imports information. They are stored in cache after they compiled and linked +with programs that provide their dependencies. To identify linked programs +Id of "main" set of symbols (i.e. the one which actually contain kernels) will +be used. + +##### Persistent cache + +The documented approach to persistent cache needs to be expanded in presence +of dynamic linking support. One of the identifiers for built image hash is +hash made out of device image used as input for the JIT compilation. +In case when "main" image have imports information, device image hash should be +created from all device images that are necessary to build it, i.e. hash out +of "main" device image and set of 'SYCL_EXTERNAL'-only images that define all +symbols imported by "main device image. ## Corner cases and limitations From 702e1a48ab8a706e6acb737a5cfa1929387ab6a2 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Fri, 5 Mar 2021 19:04:31 +0300 Subject: [PATCH 05/18] Apply suggestions from code review --- sycl/doc/SharedLibraries.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 28674f88a4aaa..b7dc535661c1c 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -77,14 +77,14 @@ Next sections describe details of changes in each component. ### DPC++ front-end changes -DPC++ front-end generates `module-id` attribute on each `SYCL_EXTERNAL` function. +DPC++ front-end generates `sycl-module-id` attribute on each `SYCL_EXTERNAL` function. It was generated only on kernels earlier. There are two reasons to start generating this attribute on `SYCL_EXTERNAL` functions: - Later in pipeline, this attribute will be used by `sycl-post-link` tool to separate `SYCL_EXTERNAL` functions from non-`SYCL_EXTERNAL` functions with external linkage. -- `module-id` attribute also contains information about source file where the +- `sycl-module-id` attribute also contains information about source file where the function comes from. This information will be used to perform device code split on device images that contain only exported functions. @@ -122,10 +122,10 @@ Besides separating `SYCL_EXTERNAL` functions from kernels, `sycl-post-link` can also distribute those functions into separate device images if device code split is requested. This is done by grouping them using `module-id` attribute. Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions with different -`module-id` attributes are copied to device images corresponding to those +`sycl-module-id` attributes are copied to device images corresponding to those `SYCL_EXTERNAL` functions to make them self-contained. In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function -with different `module-id` attribute, the second one is not copied to the +with different `sycl-module-id` attribute, the second one is not copied to the device image with the first function, but dependency between those device images is recorder instead. @@ -169,7 +169,7 @@ struct pi_device_binary_struct { }; ``` -Each property set is represent by the following struct: +Each property set is represented by the following struct: ``` // Named array of properties. struct _pi_device_binary_property_set_struct { From 60054b18c22b45f469ecdbda5a8979bcd996301f Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Wed, 24 Mar 2021 16:37:12 +0300 Subject: [PATCH 06/18] Apply suggestions from code review Co-authored-by: vladimirlaz --- sycl/doc/SharedLibraries.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index b7dc535661c1c..5d4936ede1911 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -43,7 +43,7 @@ image with definition of `foo` to the fat object via special option. The main purpose of this feature is to provide a mechanism which allows to link device code dynamically at runtime. -## Requrements: +## Requirements: User's device code that consists of some device API (`SYCL_EXTERNAL` functions), is compiled into some form and it is not linked statically with device code of application. It can be a shared library that contains some device code or a @@ -54,7 +54,7 @@ For this combination the following statements must be true: - `SYCL_EXTERNAL` functions defined in dynamically linked code can be called (directly or indirectly) from device code of the application. -- Function pointers taken in application ashould work inside the dynamically +- Function pointers taken in application should work inside the dynamically linked code. - Specific code changes are not required, i.e. the mechanism of linking works as close as possible to regular shared libraries. From 459730b9151a9b219e8ac649a8943e36df472fd6 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Fri, 9 Apr 2021 17:52:47 +0300 Subject: [PATCH 07/18] Do not separate SYCL_EXTERNAL functions from kernels --- sycl/doc/SharedLibraries.md | 299 ++++++++++++++++++++++++------------ 1 file changed, 204 insertions(+), 95 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 28674f88a4aaa..ee895cc239619 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -6,19 +6,29 @@ feature. ## Background Sometimes users want to link device code dynamically at run time. One possible use case for such linkage - providing device functions via shared libraries. -Simple source example: +Example: ``` -// App: - -CGH.parallel_for(/* ... */ { - library_function(); -}); +// app.cpp +SYCL_EXTERNAL int LibDeviceFunc(int i); +class KernelName; +/* ... */ +Q.submit([&](cl::sycl::handler &CGH) { +CGH.parallel_for(/* ... */ [=](sycl::item i) { + out[i] = LibDeviceFunc(i); +}); /* ... */ +std::cout << out[i] << “ “; + +// lib.cpp +int SYCL_EXTERNAL LibDeviceFunc(int i) { + return i * 2; +} +// Commands +clang++ -fsycl lib.cpp -shared -o helpers.so +clang++ -fsycl app.cpp -lhelpers -o a.out +./a.out +Output: 0 2 4 6… -// Shared library: -SYCL_EXTERNAL void library_function() { - // do something -} ``` It is possible to manually create `sycl::program` in both app and shared library, then use `link` SYCL API to get a single program and launch kernels @@ -31,7 +41,9 @@ provided by user. Example: // a.cpp SYCL_EXTERNAL void foo(); ... -parallel_for([]() { foo(); }); +Q.submit([&](cl::sycl::handler &CGH) { +CGH.parallel_for([]() { foo(); }); +}); // b.cpp /*no SYCL_EXTERNAL*/ void foo() { ... } @@ -43,7 +55,7 @@ image with definition of `foo` to the fat object via special option. The main purpose of this feature is to provide a mechanism which allows to link device code dynamically at runtime. -## Requrements: +## Requirements: User's device code that consists of some device API (`SYCL_EXTERNAL` functions), is compiled into some form and it is not linked statically with device code of application. It can be a shared library that contains some device code or a @@ -54,7 +66,7 @@ For this combination the following statements must be true: - `SYCL_EXTERNAL` functions defined in dynamically linked code can be called (directly or indirectly) from device code of the application. -- Function pointers taken in application ashould work inside the dynamically +- Function pointers taken in application should work inside the dynamically linked code. - Specific code changes are not required, i.e. the mechanism of linking works as close as possible to regular shared libraries. @@ -62,10 +74,8 @@ For this combination the following statements must be true: ## Design The overall idea: -- Each device image is supplied with a list of imported symbol names - through device image properties mechanism -- `SYCL_EXTERNAL` functions are arranged into separate device images supplied - with a list of exported symbol names +- Each device image is supplied with a list of imported and exported symbol + names through device image properties mechanism - Before compiling a device image DPC++ RT will check if device image has a list of imported symbols and if it has, then RT will search for device images which define required symbols using lists of exported symbols. @@ -77,72 +87,72 @@ Next sections describe details of changes in each component. ### DPC++ front-end changes -DPC++ front-end generates `module-id` attribute on each `SYCL_EXTERNAL` function. -It was generated only on kernels earlier. There are two reasons to start -generating this attribute on `SYCL_EXTERNAL` functions: - -- Later in pipeline, this attribute will be used by `sycl-post-link` tool to - separate `SYCL_EXTERNAL` functions from non-`SYCL_EXTERNAL` functions with - external linkage. -- `module-id` attribute also contains information about source file where the - function comes from. This information will be used to perform device code - split on device images that contain only exported functions. +Now during device code split process `SYCL_EXTERNAL` functions are +considered as entry points (as well as kernels). +For this purpose DPC++ front-end generates `module-id` attribute on each +`SYCL_EXTERNAL` function. ### sycl-post-link changes -To support dynamic device linkage, `sycl-post-link` performs 3 main tasks: -- Arranges `SYCL_EXTERNAL` functions into a separate device image(s) +To support dynamic linking of device code , `sycl-post-link` performs 2 main +tasks: - Supplies device images containing exports with an information about exported symbols -- Supplies each device image with an information about imported symbols - -`sycl-post-link` outlines `SYCL_EXTERNAL` functions with all their reachable -dependencies (functions with definitions called from `SYCL_EXTERNAL` ones) -into a separate device image(s) in order to create minimal self-contained -device images that can be linked from the user's app. There are several -notable moments though. - -If a `SYCL_EXTERNAL` function is used within a kernel defined in a shared -library, it will be duplicated: one instance will be stored in the kernel's -device image and the function won't exported from this device image, while the -other will be stored in a special device image for other `SYCL_EXTERNAL` -functions and will be marked as exported there. Such duplication is need for -two reasons: +- Supplies device images with an information about imported symbols + +In addition, `SYCL_EXTERNAL` functions as well as kernels are considered as entry +points during device code split. +If device code split is enabled `SYCL_EXTERNAL` functions defined in shared +libraries and used within it can be duplicated. +Example: +``` +// Shared library + +// A.cpp +SYCL_EXTERNAL int LibDeviceFunc(int i) { + return i * 2; +} + +// B.cpp +class LibKernel; +/* ... */ +Q.submit([&](cl::sycl::handler &CGH) { +CGH.parallel_for(/* ... */ [=](sycl::item i) { + out[i] = LibDeviceFunc(i); +} /* ... */ +``` +And if user requested per-source device code split, then for this shared library +`sycl-post-link` will create two device images and both of them will define +`LibDeviceFunc` function. However `LibDeviceFunc` won't be exported from device +image that corresponds to source file `B.cpp` and it will be exported only from +device image that corresponds to source file where `LibDeviceFunc` was defined, +i.e. `A.cpp`. + +Such duplication is needed for two reasons: - We aim to make device images with kernels self-contained so no JIT linker invocations would be needed if we have definitions of all called functions. - Also note that if AOT is requested, it would be impossible to link anything - at runtime. - We could export `SYCL_EXTERNAL` functions from device images with kernels, but it would mean that when user's app calls `SYCL_EXTERNAL` function, it has - to link a whole kernel and all its dependencies - not only it increases the - amount of unnecessary linked code, but might also lead to build errors if the - kernel uses some features, which are not supported by target device (and they - are not used in the `SYCL_EXTERNAL` function). -Besides separating `SYCL_EXTERNAL` functions from kernels, `sycl-post-link` -can also distribute those functions into separate device images if device code -split is requested. This is done by grouping them using `module-id` attribute. -Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions with different -`module-id` attributes are copied to device images corresponding to those -`SYCL_EXTERNAL` functions to make them self-contained. -In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function -with different `module-id` attribute, the second one is not copied to the -device image with the first function, but dependency between those device images -is recorder instead. + to link a whole kernel and all its dependencies - so we leave a possibility + for user to arrange code on per-source basis. -After `SYCL_EXTERNAL` functions are arranged into a separate device image(s), -all non-`SYCL_EXTERNAL` functions and `SYCL_EXTERNAL` functions left in device -images with kernels marked with internal linkage to avoid multiple definition -errors during runtime linking. -Device images with `SYCL_EXTERNAL` functions will also get a list of names -of exported functions attached to them through device image properties -(described below). +Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions are copied to +device images corresponding to those `SYCL_EXTERNAL` functions to make them +self-contained. +In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function +with different value in `sycl-module-id` attribute, the second one is not copied +to the device image with the first function, but dependency between those device +images is recorded instead. -**NOTE**: If device code split is enabled, it seems reasonable to perform -exports arrangement before device code split procedure. +After device code split, all non-`SYCL_EXTERNAL` functions and copied +`SYCL_EXTERNAL` functions left in device images with kernels marked with +internal linkage to avoid multiple definition errors during runtime linking. +After that `sycl-post-link` records list of names of exported functions, i.e. +functions with `sycl-module-id` attribute and external linkage. In order to collect information about imported symbols `sycl-post-link` looks -through LLVM IR and for each declared but not defined symbol records its name, -except the following cases: +through LLVM IR and for each declared but not defined symbol and records its +name, except the following cases: - Declarations with `__` prefix in demangled name are not recorded as imported functions - Declarations with `__spirv_*` prefix should not be recorded as dependencies @@ -154,7 +164,7 @@ except the following cases: starting with `__` by forward-declaring them in DPC++ code **NOTE**: If device code split is enabled, imports collection is performed after -split and it is performed on splitted images. +split and it is performed on separated images. All collected information is attached to a device image via properties mechanism. @@ -239,40 +249,139 @@ module (.exe/.so/.dll). `std::unordered_map` allows to search and access its elements with constant-time complexity. -Before compilation of device image to execute a kernel RT checks if the image -contains any import information in its properies and if it does, then RT +Before compilation of device image, to execute a kernel RT checks if the image +contains any import information in its properties and if it does, then RT performs device images collection in order to resolve dependencies. -Ids of all needed symbol sets are found. This is done by iterating through +Ids of all needed symbol sets are found by iterating through `m_SymbolSets` map, i.e. iterating through all available OS modules without predefined order and searching for first unresolved symbol in list of imports -set of target device image. Once device image that contains first symbol is -met, remaining exported symbols are checked in found image and if +of target device image. Once device image that contains first symbol is +met, remaining exported symbols are checked in found image. If they match some imported symbols then these matched symbols will be marked as -resolved. The procedure repeats until all imported symbols are resolved. -For each found symbol set Id program cache is checked in case if -necessary set of `SYCL_EXTERNAL` functions has been compiled and if it is true, -then compiled device image will be re-used for linking. -Otherwise device image containing required symbols set will be compiled and -stored in cache. +resolved. The procedure repeats until all imported symbols are marked as +resolved. #### Program caching -Existing support for device code caching is re-used to cache programs created -from device images with SYCL external functions and linked device images with -imports information. +Existing support for device code caching can be re-used to cache +dynamically linked programs with slight changes. ##### In-memory cache -Programs that contain only `SYCL_EXTERNAL` functions will be cached only in -compiled state, so they can be linked with other programs during dependency -resolution. +The existing mechanism of caching can be re-used in presence of dynamic +linking. Example: +``` +// Application +SYCL_EXTERNAL void LibFunc1(); +SYCL_EXTERNAL void LibFunc2(); + +Q.submit([&](cl::sycl::handler &CGH) { +CGH.parallel_for( ... ) +}); // 1. Device Image is compiled and linked into a program and saved in cache + // 2. Prepared program is used to enqueue kernel + +Q.submit([&](cl::sycl::handler &CGH) { +handler.parallel_for([] { LibFunc1(); }); // Prepared program is used to enqueue kernel +}); + +// Library +SYCL_EXTERNAL void LibFunc1() { +// ... +} + +``` +In current cache structure the programs map's key consists of four components: +kernel set id, specialization constants values, the device this program is built +for, build options id. In this example Id of kernel set where application's +kernels can be used to access program cache. However when shared library +defines kernels and these kernels are run by the application unchanged cache +structure may lead to double compilation of the same code. Example: +``` +// Application +SYCL_EXTERNAL void LibFunc(); + +Q.submit([&](cl::sycl::handler &CGH) { +handler.parallel_for([] { LibFunc(); }); // Device image for library is compiled + // and linked together with device + // image for application, i.e. + // LibFunc1 and ExternalKernel exist + // in prepared state +}); +// ... +EnqueueLibraryKernel(Q); // If cache mechanism is not changed, this line will + // lead to second compilation of ExternalKernel and + // LibFunc1 + +// Library +SYCL_EXTERNAL void LibFunc1() { +// ... +} + +EnqueueLibraryKernel(queue) { + queue.submit(parallel_for(...)); +} +``` +Such case can be optimized by bringing nesting into cache keys structure. +Kernel set id can be found for each kernel using its name and OS module it is +coming from. In presence of dynamic linking resulting program can be combined +out of device images which come from different OS modules. So, it should be +possible to find needed program by kernel name and any OS module that was +involved in this program. The new mapping structure is: +``` +{kernel name} => + {OSModuleHandle, spec const, opts, dev} => program +``` +I.e. each kernel name is mapped to a set of tuples that consists of OS module, +spec constant values, JIT compiler options, device. Then concrete tuple is +mapped to a program object. +Example: +``` +// Application +// OSModule = 1 + +SYCL_EXTERNAL void LibFunc(); +queue.submit(parallel_for( ... )); + +Q.submit([&](cl::sycl::handler &CGH) { +CGH.parallel_for([] { LibFunc(); }); +}); + +EnqueueLibraryKernel(q); + +// Library +// OSModule = 2 + +SYCL_EXTERNAL lib1_func(); + +EnqueueLibraryKernel(queue) { + queue.submit(parallel_for(...)); +} + +Program cache will have the following structure: + "InternalKernel" => + {1, ...} => program 1 + "ExternalKernel" => + {1, ...} => program 1 + {2, ...} => program 1 +``` +However the library code will be compiled twice if kernel from the library +was enqueued before kernels from the application, i.e. in such case: +``` +// Application +SYCL_EXTERNAL void LibFunc(); + +EnqueueLibraryKernel(Q); // First, library code is compiled alone since it + // doesn't have any dependencies +// ... +Q.submit([&](cl::sycl::handler &CGH) { +handler.parallel_for([] { LibFunc(); }); // Second, library code is compiled + // and linked together with code of + // the application +}); +``` -The existing mechanism of caching is not changed for programs with -imports information. They are stored in cache after they compiled and linked -with programs that provide their dependencies. To identify linked programs -Id of "main" set of symbols (i.e. the one which actually contain kernels) will -be used. +The program caching mechanism is re-used without changes. ##### Persistent cache @@ -281,8 +390,8 @@ of dynamic linking support. One of the identifiers for built image hash is hash made out of device image used as input for the JIT compilation. In case when "main" image have imports information, device image hash should be created from all device images that are necessary to build it, i.e. hash out -of "main" device image and set of 'SYCL_EXTERNAL'-only images that define all -symbols imported by "main device image. +of "main" device image and set of images that define all +symbols imported by "main" device image. ## Corner cases and limitations From 93e202c1c02133eb843eafe652b1f52c4df7b0df Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Mon, 12 Apr 2021 18:57:48 +0300 Subject: [PATCH 08/18] Apply suggestions from code review Co-authored-by: Alexey Sachkov Co-authored-by: sergei <57672082+s-kanaev@users.noreply.github.com> --- sycl/doc/SharedLibraries.md | 46 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index dbc2db30da1cc..1e892e0b71296 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -50,7 +50,7 @@ CGH.parallel_for([]() { foo(); }); ``` We have a `SYCL_EXTERNAL` function `foo` called from a kernel, but the application defined only host version of this function. Then user adds device -image with definition of `foo` to the fat object via special option. +image with definition of `foo` to the fat object via special compiler option (like `-fsycl-add-targets`). The main purpose of this feature is to provide a mechanism which allows to link device code dynamically at runtime. @@ -59,7 +59,7 @@ link device code dynamically at runtime. User's device code that consists of some device API (`SYCL_EXTERNAL` functions), is compiled into some form and it is not linked statically with device code of application. It can be a shared library that contains some device code or a -separate device image supplied with property information. This code is linked +separate device image supplied with properties attached. This code is linked dynamically at run time with device code of a user's application in order to resolve dependencies. For this combination the following statements must be true: @@ -121,7 +121,7 @@ CGH.parallel_for(/* ... */ [=](sycl::item i) { out[i] = LibDeviceFunc(i); } /* ... */ ``` -And if user requested per-source device code split, then for this shared library +If user requested per-source device code split, then for this shared library `sycl-post-link` will create two device images and both of them will define `LibDeviceFunc` function. However `LibDeviceFunc` won't be exported from device image that corresponds to source file `B.cpp` and it will be exported only from @@ -138,7 +138,7 @@ Such duplication is needed for two reasons: Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions are copied to device images corresponding to those `SYCL_EXTERNAL` functions to make them -self-contained. +self-contained - in the same way as it is done when splitting kernels across device images. In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function with different value in `sycl-module-id` attribute, the second one is not copied to the device image with the first function, but dependency between those device @@ -151,13 +151,13 @@ After that `sycl-post-link` records list of names of exported functions, i.e. functions with `sycl-module-id` attribute and external linkage. In order to collect information about imported symbols `sycl-post-link` looks -through LLVM IR and for each declared but not defined symbol and records its +through LLVM IR and for each declared but not defined symbol records its name, except the following cases: - Declarations with `__` prefix in demangled name are not recorded as imported functions - Declarations with `__spirv_*` prefix should not be recorded as dependencies - since they represent SPIR-V operations and will be transformed to SPIR-V - instructions during LLVM->SPIR-V translation. + since they represent SPIR-V operations and will be transformed to SPIR-V + instructions during LLVM->SPIR-V translation. - Based on some attributes (which could be defined later) we may want to avoid listing some functions as imported ones - This is needed to have possibility to call device-specific builtins not @@ -214,7 +214,7 @@ links them together using PI API. #### Device images collection -DPC++ Runtime class named ProgramManager stores device images using following +DPC++ Runtime class named `ProgramManager` stores device images using following data structure: ``` /// Keeps all available device executable images added via \ref addImages. @@ -234,10 +234,10 @@ Assume each device image represents some combination of symbols and different device images may contain only exactly the same or not overlapping combination of symbols. If it is not so, there can be two cases: - Symbols are the same. In this case it doesn't matter which device image is - taken to use duplicated symbol + taken to use duplicated symbol - Symbols are not the same. In this case ODR violation takes place, such - situation leads to undefined behaviour. For more details refer to - [ODR violations](#ODR-violations) section. + situation leads to undefined behaviour. For more details refer to + [ODR violations](#ODR-violations) section. Each combination of symbols is assigned with an Id number - symbol set Id. A combination of symbols can exist in different formats (i.e. SPIR-V/AOT @@ -302,19 +302,19 @@ structure may lead to double compilation of the same code. Example: SYCL_EXTERNAL void LibFunc(); Q.submit([&](cl::sycl::handler &CGH) { -handler.parallel_for([] { LibFunc(); }); // Device image for library is compiled - // and linked together with device - // image for application, i.e. - // LibFunc1 and ExternalKernel exist - // in prepared state + handler.parallel_for([] { LibFunc(); }); // Device image for library is compiled + // and linked together with device + // image for application, i.e. + // LibFunc1 and ExternalKernel exist + // in prepared state }); // ... EnqueueLibraryKernel(Q); // If cache mechanism is not changed, this line will // lead to second compilation of ExternalKernel and - // LibFunc1 + // LibFunc // Library -SYCL_EXTERNAL void LibFunc1() { +SYCL_EXTERNAL void LibFunc() { // ... } @@ -344,7 +344,7 @@ SYCL_EXTERNAL void LibFunc(); queue.submit(parallel_for( ... )); Q.submit([&](cl::sycl::handler &CGH) { -CGH.parallel_for([] { LibFunc(); }); + CGH.parallel_for([] { LibFunc(); }); }); EnqueueLibraryKernel(q); @@ -352,7 +352,7 @@ EnqueueLibraryKernel(q); // Library // OSModule = 2 -SYCL_EXTERNAL lib1_func(); +SYCL_EXTERNAL LibFunc(); EnqueueLibraryKernel(queue) { queue.submit(parallel_for(...)); @@ -375,9 +375,9 @@ EnqueueLibraryKernel(Q); // First, library code is compiled alone since it // doesn't have any dependencies // ... Q.submit([&](cl::sycl::handler &CGH) { -handler.parallel_for([] { LibFunc(); }); // Second, library code is compiled - // and linked together with code of - // the application + handler.parallel_for([] { LibFunc(); }); // Second, library code is compiled + // and linked together with code of + // the application }); ``` From b8fb778e2a245adb4c3c8e3cacea8e5e244b8fff Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Tue, 13 Apr 2021 12:21:06 +0300 Subject: [PATCH 09/18] Apply suggestions from code review Co-authored-by: kbobrovs --- sycl/doc/SharedLibraries.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 1e892e0b71296..63ee01dd99a98 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -6,7 +6,7 @@ feature. ## Background Sometimes users want to link device code dynamically at run time. One possible use case for such linkage - providing device functions via shared libraries. -Example: +The example below shows how device function `LibDeviceFunc` can be dynamically linked to a SYCL app: ``` // app.cpp SYCL_EXTERNAL int LibDeviceFunc(int i); @@ -35,8 +35,8 @@ library, then use `link` SYCL API to get a single program and launch kernels using it. But it is not user-friendly and it is very different from regular C/C++ workflow. -Another possible scenario - use functions defined in pre-compiled device image -provided by user. Example: +Another possible scenario - use functions defined in a pre-compiled device image +provided by the user. Example: ``` // a.cpp SYCL_EXTERNAL void foo(); @@ -53,7 +53,7 @@ application defined only host version of this function. Then user adds device image with definition of `foo` to the fat object via special compiler option (like `-fsycl-add-targets`). The main purpose of this feature is to provide a mechanism which allows to -link device code dynamically at runtime. +link device code dynamically at runtime, such as in the scenarios above. ## Requirements: User's device code that consists of some device API (`SYCL_EXTERNAL` functions), From 850b94ffc452045f9fc0ff7f4d6df2ccec0125c3 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Tue, 13 Apr 2021 13:54:08 +0300 Subject: [PATCH 10/18] Apply review feedback - Reduce low-level details of RT section - Add info how cache items are created - Name examples - Apply direct comments --- sycl/doc/SharedLibraries.md | 144 ++++++++++++++++++------------------ 1 file changed, 74 insertions(+), 70 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 63ee01dd99a98..5dfc3efdab4d8 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -6,7 +6,8 @@ feature. ## Background Sometimes users want to link device code dynamically at run time. One possible use case for such linkage - providing device functions via shared libraries. -The example below shows how device function `LibDeviceFunc` can be dynamically linked to a SYCL app: +The example below shows how device function `LibDeviceFunc` can be dynamically +linked to a SYCL app: ``` // app.cpp SYCL_EXTERNAL int LibDeviceFunc(int i); @@ -39,18 +40,19 @@ Another possible scenario - use functions defined in a pre-compiled device image provided by the user. Example: ``` // a.cpp -SYCL_EXTERNAL void foo(); +SYCL_EXTERNAL void LibDeviceFunc(); ... Q.submit([&](cl::sycl::handler &CGH) { -CGH.parallel_for([]() { foo(); }); +CGH.parallel_for([]() { LibDeviceFunc(); }); }); // b.cpp -/*no SYCL_EXTERNAL*/ void foo() { ... } +/*no SYCL_EXTERNAL*/ void LibDeviceFunc() { ... } ``` We have a `SYCL_EXTERNAL` function `foo` called from a kernel, but the application defined only host version of this function. Then user adds device -image with definition of `foo` to the fat object via special compiler option (like `-fsycl-add-targets`). +image with definition of `foo` to the fat object via special compiler option +(like `-fsycl-add-targets`). The main purpose of this feature is to provide a mechanism which allows to link device code dynamically at runtime, such as in the scenarios above. @@ -64,23 +66,36 @@ dynamically at run time with device code of a user's application in order to resolve dependencies. For this combination the following statements must be true: -- `SYCL_EXTERNAL` functions defined in dynamically linked code can be called - (directly or indirectly) from device code of the application. -- Function pointers taken in application should work inside the dynamically - linked code. +The presented dynamic device code linkage mechanism must: + +- Allow to represent the actual dynamically linked code as a device binary image + which can be: + - Embedded into a host shared object by standard SYCL compiler driver + invocation + - Embedded into a host binary or shared object using manual invocations of + SYCL tools such as `clang-offload-wrapper` and linker +- Must not assume the actual format of the device code - e.g. that it is SPIR-V + or native device binary +- Provide automatic runtime resolution of `SYCL_EXTERNAL` function references + within the SYCL app to their definitions (if found) within any suitable + dynamically linked device binary image +- Support pointers to `SYCL_EXTERNAL` functions across the dynamic linkage + boundaries within the device code - taking a pointer, call through a pointer. - Specific code changes are not required, i.e. the mechanism of linking works - as close as possible to regular shared libraries. + as close as possible to host shared libraries. ## Design The overall idea: - Each device image is supplied with a list of imported and exported symbol names through device image properties mechanism -- Before compiling a device image DPC++ RT will check if device image has a list - of imported symbols and if it has, then RT will search for device images which - define required symbols using lists of exported symbols. +- Before JIT-compiling a device image DPC++ RT will check if device image has a + list of imported symbols and if it has, then RT will search for device images + which define required symbols using lists of exported symbols. - Besides symbol names, additional attributes are taken into account (like device image format: SPIR-V or device asm) + - No logical binding between host module and export/import lists, i.e. + resolution is performed w/o regard to containing host modules - Actual linking is performed by underlying backend (OpenCL/L0/etc.) Next sections describe details of changes in each component. @@ -211,56 +226,40 @@ corresponding set has the name `SYCL/exported symbols`. DPC++ RT performs *device images collection* task by grouping all device images required to execute a kernel based on the list of exports/imports and links them together using PI API. +This native device image is then added to the cache to avoid symbol resolution, +compilation, and linking for any future attempts to invoke kernels from this +device image. -#### Device images collection +#### Device images collection and linking -DPC++ Runtime class named `ProgramManager` stores device images using following -data structure: -``` -/// Keeps all available device executable images added via \ref addImages. -/// Organizes the images as a map from a kernel set id to the vector of images -/// containing kernels from that set. -/// Access must be guarded by the \ref Sync::getGlobalLock() -std::unordered_map>> - m_DeviceImages; - -using StrToKSIdMap = std::unordered_map; -/// Maps names of kernels from a specific OS module (.exe .dll) to their set -/// id (the sets are disjoint). -std::unordered_map m_SymbolSets; -``` -Assume each device image represents some combination of symbols and different -device images may contain only exactly the same or not overlapping combination -of symbols. If it is not so, there can be two cases: - - Symbols are the same. In this case it doesn't matter which device image is - taken to use duplicated symbol - - Symbols are not the same. In this case ODR violation takes place, such - situation leads to undefined behaviour. For more details refer to - [ODR violations](#ODR-violations) section. - -Each combination of symbols is assigned with an Id number - symbol set Id. -A combination of symbols can exist in different formats (i.e. SPIR-V/AOT -compiled binary and etc). -`m_DeviceImages` maps an Id number to an array with device images which represent -the same combination of symbols in different formats. -`m_SymbolSets` contains mapping from symbol name to symbol set Id for each OS -module (.exe/.so/.dll). -`std::unordered_map` allows to search and access its elements with constant-time -complexity. - -Before compilation of device image, to execute a kernel RT checks if the image -contains any import information in its properties and if it does, then RT -performs device images collection in order to resolve dependencies. - -Ids of all needed symbol sets are found by iterating through -`m_SymbolSets` map, i.e. iterating through all available OS modules without -predefined order and searching for first unresolved symbol in list of imports -of target device image. Once device image that contains first symbol is +Device images collection and linking is performed by DPC++ Runtime class named +`ProgramManager`. + +When the program manager gets a request to JIT compile a device image(program) +it examines its list of imported symbols and finds device images which exports +those symbols, then links requested device image and images found together. + +All needed device images are found by iterating through all available OS modules +without predefined order and searching for first unresolved symbol in list of +imports of target device image. Once device image that contains first symbol is met, remaining exported symbols are checked in found image. If they match some imported symbols then these matched symbols will be marked as resolved. The procedure repeats until all imported symbols are marked as -resolved. +resolved. In case all available device images are viewed, but some imported +symbols remain unresolved, exception will be thrown. + +The following assumption is made: each device image represents some combination +of defined symbols (kernels or `SYCL_EXTERNAL` functions) and different +device images may contain only exactly the same or not overlapping combination +of defined symbols. If this assumption is not correct, there can be two cases: + - Same symbols have the same definitions. In this case it doesn't matter which + device image is taken to use duplicated symbol + - Same symbols have different definitions. In this case ODR violation takes + place, such situation leads to undefined behaviour. For more details refer + to [ODR violations](#ODR-violations) section. + +So, it is valid to pick the met first device image which defines required symbol +during search. #### Program caching @@ -270,11 +269,11 @@ dynamically linked programs with slight changes. ##### In-memory cache The existing mechanism of caching can be re-used in presence of dynamic -linking. Example: +linking. Example of code when caching mechanism is successfully re-used for +dynamically linked code: ``` // Application -SYCL_EXTERNAL void LibFunc1(); -SYCL_EXTERNAL void LibFunc2(); +SYCL_EXTERNAL void LibFunc(); Q.submit([&](cl::sycl::handler &CGH) { CGH.parallel_for( ... ) @@ -282,11 +281,11 @@ CGH.parallel_for( ... ) // 2. Prepared program is used to enqueue kernel Q.submit([&](cl::sycl::handler &CGH) { -handler.parallel_for([] { LibFunc1(); }); // Prepared program is used to enqueue kernel +handler.parallel_for([] { LibFunc(); }); // Prepared program is used to enqueue kernel }); // Library -SYCL_EXTERNAL void LibFunc1() { +SYCL_EXTERNAL void LibFunc() { // ... } @@ -296,7 +295,8 @@ kernel set id, specialization constants values, the device this program is built for, build options id. In this example Id of kernel set where application's kernels can be used to access program cache. However when shared library defines kernels and these kernels are run by the application unchanged cache -structure may lead to double compilation of the same code. Example: +structure may lead to double compilation of the same code. Example of code +that leads to double compilation of library code: ``` // Application SYCL_EXTERNAL void LibFunc(); @@ -333,9 +333,13 @@ involved in this program. The new mapping structure is: {OSModuleHandle, spec const, opts, dev} => program ``` I.e. each kernel name is mapped to a set of tuples that consists of OS module, -spec constant values, JIT compiler options, device. Then concrete tuple is -mapped to a program object. -Example: +spec constant values, JIT compiler options and device. Then concrete tuple is +mapped to a program object. Several tuples can be mapped to a same program +object, they are created during process of compilation and symbols resolution +for concrete device image. When some program is made through linking of several +device images that come from different OS modules, for each OS module in cache +will be created a tuple with corresponding OS module id. +Example of modified cache structure when dynamic linking is involved: ``` // Application // OSModule = 1 @@ -381,7 +385,7 @@ Q.submit([&](cl::sycl::handler &CGH) { }); ``` -The program caching mechanism is re-used without changes. +The kernel caching mechanism is re-used without changes. ##### Persistent cache @@ -396,8 +400,8 @@ symbols imported by "main" device image. ## Corner cases and limitations It is not guaranteed that behaviour of host shared libraries and device shared -libraries will always match. There are several cases when it can occur, the -next sections will cover details of such cases. +libraries will always match. There are several cases when behaviours don't match, +the next sections will cover details of such cases. ### ODR violations From d176e1c640477d0067af754814628983587e6730 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Thu, 15 Apr 2021 11:29:18 +0300 Subject: [PATCH 11/18] Apply review feedback --- sycl/doc/SharedLibraries.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 5dfc3efdab4d8..b49ed2fffedfb 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -74,6 +74,7 @@ The presented dynamic device code linkage mechanism must: invocation - Embedded into a host binary or shared object using manual invocations of SYCL tools such as `clang-offload-wrapper` and linker + - Loaded into memory via special API - Must not assume the actual format of the device code - e.g. that it is SPIR-V or native device binary - Provide automatic runtime resolution of `SYCL_EXTERNAL` function references @@ -97,6 +98,8 @@ The overall idea: - No logical binding between host module and export/import lists, i.e. resolution is performed w/o regard to containing host modules - Actual linking is performed by underlying backend (OpenCL/L0/etc.) + - Underlying backend is the backend used by the SYCL RT to perform JIT + compilation of the program with symbols that need dynamic resolution. Next sections describe details of changes in each component. @@ -257,6 +260,9 @@ of defined symbols. If this assumption is not correct, there can be two cases: - Same symbols have different definitions. In this case ODR violation takes place, such situation leads to undefined behaviour. For more details refer to [ODR violations](#ODR-violations) section. + - The situation when two device images of different formats define the same + symbols with two different definitions is not considered as ODR violation. + In this case the suitable device image will be picked. So, it is valid to pick the met first device image which defines required symbol during search. From fbb67d1e7fb8ddb2b9884f08a62079658265b0be Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Mon, 19 Apr 2021 16:45:38 +0300 Subject: [PATCH 12/18] Apply suggestions from code review Co-authored-by: Alexey Sachkov --- sycl/doc/SharedLibraries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index b49ed2fffedfb..a367c34ead31c 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -253,7 +253,7 @@ symbols remain unresolved, exception will be thrown. The following assumption is made: each device image represents some combination of defined symbols (kernels or `SYCL_EXTERNAL` functions) and different -device images may contain only exactly the same or not overlapping combination +device images either contain exactly the same symbols or not overlapping list of defined symbols. If this assumption is not correct, there can be two cases: - Same symbols have the same definitions. In this case it doesn't matter which device image is taken to use duplicated symbol From 7b7aa6604b40b8eaf2ae06914a699256b3b46c93 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Mon, 19 Apr 2021 19:56:49 +0300 Subject: [PATCH 13/18] Add PI section --- sycl/doc/SharedLibraries.md | 50 +++++++++++++++++++++++++++++++++---- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index a367c34ead31c..4a7fad8d552e2 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -233,6 +233,44 @@ This native device image is then added to the cache to avoid symbol resolution, compilation, and linking for any future attempts to invoke kernels from this device image. +#### DPC++ runtime plugin interface (PI) changes + +Before creating a program the function `piextDeviceSelectBinary` is used to +choose the most appropriate device image. It is possible that not all backends +have possibility to link device images of particular format at run-time. So, +in presence of dynamic linking the `piextDeviceSelectBinary` function should be +extended, so it chooses the appropriate device image +based on additional attributes of device images, such as: +- List of imports (if present) +- Device image format +- Runtime linking support in corresponding backend + +Example: the backend doesn't have support of native binaries linking at +run-time but linking of SPIR-V device images is supported, +the AOT-compiled device image with required kernel have imports +information attached which effectively means that this device image needs runtime +linking, but since native binaries linking is not supported, the image with +SPIR-V format will be chosen (or an error emitted if there is no other device +images). +To link several device images together `piProgramLink` API will be used. +Depending on concrete plugin implementation and set of device image formats that +can be linked at run-time, `piProgramLink` API may receive device images in +different states as inputs (including SPIR-V and native code) with a limitation +that all inputs should have the same format. + +##### Support of runtime linking in backends + +- The initial design will support dynamic linking of device code in SPIR-V + format on OpenCL backend: + - OpenCL plugin will use the existing OpenCL `clLinkProgram()` API to online + link the SPIR-V modules together. +- The initial design will support dynamic linking of device code in native code + format on the Level Zero backend: + - L0 plugin will use the existing Level Zero `zeModuleDynamicLink()` API to do + the linking. + +In the future support may be extended to different formats. + #### Device images collection and linking Device images collection and linking is performed by DPC++ Runtime class named @@ -244,11 +282,13 @@ those symbols, then links requested device image and images found together. All needed device images are found by iterating through all available OS modules without predefined order and searching for first unresolved symbol in list of -imports of target device image. Once device image that contains first symbol is -met, remaining exported symbols are checked in found image. If -they match some imported symbols then these matched symbols will be marked as -resolved. The procedure repeats until all imported symbols are marked as -resolved. In case all available device images are viewed, but some imported +imports of target device image. During search device image format is taken +into account, i.e. only device images that have the same format as target device +image will be considered as suitable ones. Once suitable device image that +contains first symbol is met, remaining exported symbols are checked in found +image. If they match some imported symbols then these matched symbols will be +marked as resolved. The procedure repeats until all imported symbols are marked +as resolved. In case all available device images are viewed, but some imported symbols remain unresolved, exception will be thrown. The following assumption is made: each device image represents some combination From 9f2b787fdf609f1d81e48eb397262a773e4ed094 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Tue, 20 Apr 2021 11:31:31 +0300 Subject: [PATCH 14/18] Mention required L0 API --- sycl/doc/SharedLibraries.md | 1 + 1 file changed, 1 insertion(+) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 4a7fad8d552e2..8f0340bfda6dd 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -264,6 +264,7 @@ that all inputs should have the same format. format on OpenCL backend: - OpenCL plugin will use the existing OpenCL `clLinkProgram()` API to online link the SPIR-V modules together. + - The design requires a new Level Zero API to online link SPIR-V modules. - The initial design will support dynamic linking of device code in native code format on the Level Zero backend: - L0 plugin will use the existing Level Zero `zeModuleDynamicLink()` API to do From b496ef4d88a5af788bc55102fb104e8fa230b856 Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Thu, 22 Apr 2021 13:44:49 +0300 Subject: [PATCH 15/18] Apply review feedback --- sycl/doc/SharedLibraries.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 8f0340bfda6dd..016920973e635 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -49,20 +49,21 @@ CGH.parallel_for([]() { LibDeviceFunc(); }); // b.cpp /*no SYCL_EXTERNAL*/ void LibDeviceFunc() { ... } ``` -We have a `SYCL_EXTERNAL` function `foo` called from a kernel, but the +We have a `SYCL_EXTERNAL` function `LibDeviceFunc` called from a kernel, but the application defined only host version of this function. Then user adds device -image with definition of `foo` to the fat object via special compiler option -(like `-fsycl-add-targets`). +image with definition of `LibDeviceFunc` to the fat object via special compiler +option (like `-fsycl-add-targets`). -The main purpose of this feature is to provide a mechanism which allows to -link device code dynamically at runtime, such as in the scenarios above. +The main purpose of this feature is to provide a user-friendly mechanism which +allows to link device code dynamically at runtime, such as in the scenarios +above. ## Requirements: User's device code that consists of some device API (`SYCL_EXTERNAL` functions), is compiled into some form and it is not linked statically with device code of -application. It can be a shared library that contains some device code or a +application. It can be a shared library with embedded device image or a separate device image supplied with properties attached. This code is linked -dynamically at run time with device code of a user's application in order to +dynamically at run time with device image of a user's application in order to resolve dependencies. For this combination the following statements must be true: @@ -75,8 +76,8 @@ The presented dynamic device code linkage mechanism must: - Embedded into a host binary or shared object using manual invocations of SYCL tools such as `clang-offload-wrapper` and linker - Loaded into memory via special API -- Must not assume the actual format of the device code - e.g. that it is SPIR-V - or native device binary +- Allow different format for device code - e.g. it can be SPIR-V or native + device binary - Provide automatic runtime resolution of `SYCL_EXTERNAL` function references within the SYCL app to their definitions (if found) within any suitable dynamically linked device binary image @@ -94,7 +95,7 @@ The overall idea: list of imported symbols and if it has, then RT will search for device images which define required symbols using lists of exported symbols. - Besides symbol names, additional attributes are taken into account (like - device image format: SPIR-V or device asm) + device image format: SPIR-V or native device binary) - No logical binding between host module and export/import lists, i.e. resolution is performed w/o regard to containing host modules - Actual linking is performed by underlying backend (OpenCL/L0/etc.) From 03f6b9fdc0bac5895e105106144b58e88bbfa05f Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Fri, 23 Apr 2021 17:52:00 +0300 Subject: [PATCH 16/18] Incorporate review feedback In addition, link programs, not images. --- sycl/doc/SharedLibraries.md | 118 ++++++++++++++++++++++-------------- 1 file changed, 73 insertions(+), 45 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 016920973e635..5f7798c51f996 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -31,6 +31,14 @@ clang++ -fsycl app.cpp -lhelpers -o a.out Output: 0 2 4 6… ``` +The first invocation of `clang++` driver will create a "fat" shared library +which contains both host code and device code. The second invocation of +`clang++` driver will create a "fat" application binary that also contains +embedded device code. Host part of the application and library will be linked +automatically by standard C++ toolchain and system linker, while linking of +device part of the application and library requires new functionality which is +described in this document. + It is possible to manually create `sycl::program` in both app and shared library, then use `link` SYCL API to get a single program and launch kernels using it. But it is not user-friendly and it is very different from regular @@ -69,13 +77,24 @@ For this combination the following statements must be true: The presented dynamic device code linkage mechanism must: -- Allow to represent the actual dynamically linked code as a device binary image - which can be: - - Embedded into a host shared object by standard SYCL compiler driver - invocation - - Embedded into a host binary or shared object using manual invocations of - SYCL tools such as `clang-offload-wrapper` and linker - - Loaded into memory via special API +- Allow to link device code represented as device binary image dynamically at + runtime with other device binary images. In order to use this functionality + the user can create and supply device binary image to DPC++ Runtime library + via following ways: + - Create a "fat" shared library by standard SYCL compiler driver invocation + - Supply host binary or shared object with device binary image using manual + invocations of SYCL tools such as `clang-offload-wrapper` and linker + - Load device binary image into memory via dlopen-like API + - This is a TODO item, since SYCL standard doesn't define such API yet. + Example how such API may look like: + ``` + // suppose, mylib.spv defines SYCL_EXTERNAL function foo, then this call: + device_image img = device_dlopen("mylib.spv"); + // will make foo available for dynamic symbol resolution. If any subsequent + // JIT compilations try to compile device code with external reference to + // foo, it can now be resolved following the resolution mechanism described + // in this doc, and JIT compilation will succeed. + ``` - Allow different format for device code - e.g. it can be SPIR-V or native device binary - Provide automatic runtime resolution of `SYCL_EXTERNAL` function references @@ -91,16 +110,21 @@ The overall idea: - Each device image is supplied with a list of imported and exported symbol names through device image properties mechanism -- Before JIT-compiling a device image DPC++ RT will check if device image has a - list of imported symbols and if it has, then RT will search for device images - which define required symbols using lists of exported symbols. +- In order to create a program executable from device image DPC++ RT will check + if this device image has a list of imported symbols and if it has, then RT + will search for device images which define required symbols using lists of + exported symbols. - Besides symbol names, additional attributes are taken into account (like device image format: SPIR-V or native device binary) - No logical binding between host module and export/import lists, i.e. resolution is performed w/o regard to containing host modules +- All found device images are used to create program objects and then these + programs are linked together. - Actual linking is performed by underlying backend (OpenCL/L0/etc.) - - Underlying backend is the backend used by the SYCL RT to perform JIT - compilation of the program with symbols that need dynamic resolution. + - Underlying backend is the backend used by DPC++ RT to create program + from device binary image, perform JIT compilation (if required for chosen + device image format) and linking with other programs in order to resolve + symbols. Next sections describe details of changes in each component. @@ -157,7 +181,8 @@ Such duplication is needed for two reasons: Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions are copied to device images corresponding to those `SYCL_EXTERNAL` functions to make them -self-contained - in the same way as it is done when splitting kernels across device images. +self-contained - in the same way as it is done when splitting kernels across +device images. In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function with different value in `sycl-module-id` attribute, the second one is not copied to the device image with the first function, but dependency between those device @@ -228,45 +253,46 @@ corresponding set has the name `SYCL/exported symbols`. ### DPC++ runtime changes DPC++ RT performs *device images collection* task by grouping all device -images required to execute a kernel based on the list of exports/imports and -links them together using PI API. -This native device image is then added to the cache to avoid symbol resolution, -compilation, and linking for any future attempts to invoke kernels from this -device image. +images required to execute a kernel based on the list of exports/imports, creates +programs using collected images and links them together using PI API. +Resulting program is then added to the cache to avoid repetition of symbol +resolution, compilation, and linking processes for any future attempts to invoke +kernels defined by this program. #### DPC++ runtime plugin interface (PI) changes Before creating a program the function `piextDeviceSelectBinary` is used to -choose the most appropriate device image. It is possible that not all backends -have possibility to link device images of particular format at run-time. So, -in presence of dynamic linking the `piextDeviceSelectBinary` function should be -extended, so it chooses the appropriate device image -based on additional attributes of device images, such as: +choose the most appropriate device image. Device image may have SPIR-V or native +binary code format. +It is possible that not all backends have possibility to link programs made from +device images of some format at runtime. So, in presence of dynamic linking the +`piextDeviceSelectBinary` function should be extended, so it chooses the +appropriate device image based on additional attributes of device images, such +as: - List of imports (if present) - Device image format - Runtime linking support in corresponding backend Example: the backend doesn't have support of native binaries linking at -run-time but linking of SPIR-V device images is supported, -the AOT-compiled device image with required kernel have imports -information attached which effectively means that this device image needs runtime -linking, but since native binaries linking is not supported, the image with -SPIR-V format will be chosen (or an error emitted if there is no other device -images). +run-time but linking of SPIR-V is supported, the AOT-compiled device image with +required kernel have imports information attached which effectively means that +this device image needs runtime linking, but since native binaries linking is +not supported, the image with SPIR-V format will be chosen (or an error emitted +if there is no other device images). To link several device images together `piProgramLink` API will be used. Depending on concrete plugin implementation and set of device image formats that -can be linked at run-time, `piProgramLink` API may receive device images in -different states as inputs (including SPIR-V and native code) with a limitation -that all inputs should have the same format. +can be linked at run-time, `piProgramLink` API may receive programs made from +device images in different formats as inputs (including SPIR-V and native code) +with a limitation that used images should have the same format. ##### Support of runtime linking in backends -- The initial design will support dynamic linking of device code in SPIR-V +- The initial implementation will support dynamic linking of device code in SPIR-V format on OpenCL backend: - OpenCL plugin will use the existing OpenCL `clLinkProgram()` API to online link the SPIR-V modules together. - The design requires a new Level Zero API to online link SPIR-V modules. -- The initial design will support dynamic linking of device code in native code +- The initial implementation will support dynamic linking of device code in native code format on the Level Zero backend: - L0 plugin will use the existing Level Zero `zeModuleDynamicLink()` API to do the linking. @@ -275,12 +301,13 @@ In the future support may be extended to different formats. #### Device images collection and linking -Device images collection and linking is performed by DPC++ Runtime class named -`ProgramManager`. +Device images collection and linking of programs is performed by DPC++ Runtime +class named `ProgramManager`. -When the program manager gets a request to JIT compile a device image(program) -it examines its list of imported symbols and finds device images which exports -those symbols, then links requested device image and images found together. +When the program manager gets a request to create a program object using device +image, it examines its list of imported symbols and finds device images which +export those symbols, then program manager creates programs for each required +device image and links them all together. All needed device images are found by iterating through all available OS modules without predefined order and searching for first unresolved symbol in list of @@ -325,7 +352,7 @@ SYCL_EXTERNAL void LibFunc(); Q.submit([&](cl::sycl::handler &CGH) { CGH.parallel_for( ... ) -}); // 1. Device Image is compiled and linked into a program and saved in cache +}); // 1. Program is compiled, linked and saved in cache // 2. Prepared program is used to enqueue kernel Q.submit([&](cl::sycl::handler &CGH) { @@ -350,9 +377,9 @@ that leads to double compilation of library code: SYCL_EXTERNAL void LibFunc(); Q.submit([&](cl::sycl::handler &CGH) { - handler.parallel_for([] { LibFunc(); }); // Device image for library is compiled + handler.parallel_for([] { LibFunc(); }); // Device code for library is compiled // and linked together with device - // image for application, i.e. + // code for application, i.e. // LibFunc1 and ExternalKernel exist // in prepared state }); @@ -385,8 +412,9 @@ spec constant values, JIT compiler options and device. Then concrete tuple is mapped to a program object. Several tuples can be mapped to a same program object, they are created during process of compilation and symbols resolution for concrete device image. When some program is made through linking of several -device images that come from different OS modules, for each OS module in cache -will be created a tuple with corresponding OS module id. +programs created from device images that come from different OS modules, +for each OS module in cache will be created a tuple with corresponding OS module +id. Example of modified cache structure when dynamic linking is involved: ``` // Application @@ -472,7 +500,7 @@ undefined behaviour, however it is possible to run and compile this example on Linux and Windows. Whereas on Linux only function `b()` from library libB is called, on Windows both versions of function `b()` are used. Most of backends online linkers act like static linkers, i.e. just merge -device images with each other, so it is not possible to correctly imitate +device code from different programs, so it is not possible to correctly imitate Windows behaviour in device code linking because attempts to do it will result in multiple definition errors. From 8ba2c92e49ea45031acaff5cde06402d9fa8997b Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Tue, 18 May 2021 12:17:24 +0300 Subject: [PATCH 17/18] Modify PI section --- sycl/doc/SharedLibraries.md | 61 +++++++++++++++++++------------------ 1 file changed, 32 insertions(+), 29 deletions(-) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index 5f7798c51f996..ceb9d892065a3 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -158,11 +158,11 @@ SYCL_EXTERNAL int LibDeviceFunc(int i) { // B.cpp class LibKernel; -/* ... */ +/* ... */ Q.submit([&](cl::sycl::handler &CGH) { CGH.parallel_for(/* ... */ [=](sycl::item i) { out[i] = LibDeviceFunc(i); -} /* ... */ +} /* ... */ ``` If user requested per-source device code split, then for this shared library `sycl-post-link` will create two device images and both of them will define @@ -261,29 +261,26 @@ kernels defined by this program. #### DPC++ runtime plugin interface (PI) changes -Before creating a program the function `piextDeviceSelectBinary` is used to -choose the most appropriate device image. Device image may have SPIR-V or native -binary code format. -It is possible that not all backends have possibility to link programs made from -device images of some format at runtime. So, in presence of dynamic linking the -`piextDeviceSelectBinary` function should be extended, so it chooses the -appropriate device image based on additional attributes of device images, such -as: -- List of imports (if present) -- Device image format -- Runtime linking support in corresponding backend - -Example: the backend doesn't have support of native binaries linking at -run-time but linking of SPIR-V is supported, the AOT-compiled device image with -required kernel have imports information attached which effectively means that -this device image needs runtime linking, but since native binaries linking is -not supported, the image with SPIR-V format will be chosen (or an error emitted -if there is no other device images). +During *device images collection* process RT considers modules as available for +linking using information about ability of chosen device backend to compile +and link programs created from particular device image format. The information +about ability to compile and link particular format of device code is provided +by PI plugin implementation for concrete backend. For this purpose +`piDeviceGetInfo` API is used. For each device image format supported by DPC++ +RT PI device extension is defined. Each extension is a string that can be +returned by `piDeviceGetInfo` call with query `PI_DEVICE_INFO_EXTENSIONS`. +Mapping of extension strings and formats that can be linked: +| Device image format | Extension string | Meaning | +|---------------------|------------------|---------| +| __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64 | "pi_ext_spirv64_linking" | Linking of SPIR-V 64-bit programs is supported| +| __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_X86_64 | "pi_ext_spirv64_x86_64_linking" | Linking of 64-bit programs that were AOT compiled for CPU device is supported| +| __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_GEN | "pi_ext_spirv64_gen_linking" | Linking of 64-bit programs that were AOT compiled for GPU device is supported| +| __SYCL_PI_DEVICE_BINARY_TARGET_SPIRV64_FPGA | "pi_ext_spirv64_fpga_linking" | Linking of 64-bit programs that were AOT compiled for FPGA device is supported| + To link several device images together `piProgramLink` API will be used. Depending on concrete plugin implementation and set of device image formats that can be linked at run-time, `piProgramLink` API may receive programs made from -device images in different formats as inputs (including SPIR-V and native code) -with a limitation that used images should have the same format. +device images in different formats as inputs (including SPIR-V and native code). ##### Support of runtime linking in backends @@ -291,13 +288,19 @@ with a limitation that used images should have the same format. format on OpenCL backend: - OpenCL plugin will use the existing OpenCL `clLinkProgram()` API to online link the SPIR-V modules together. - - The design requires a new Level Zero API to online link SPIR-V modules. -- The initial implementation will support dynamic linking of device code in native code - format on the Level Zero backend: - - L0 plugin will use the existing Level Zero `zeModuleDynamicLink()` API to do - the linking. - -In the future support may be extended to different formats. + - A new Level Zero API to online link programs on SPIR-V level is required for + better performance. + - While there is no Level Zero API to link programs on SPIR-V level, existing + `zeModuleDynamicLink()` can be used as fallback. + +- In order to support dynamic linking of AOT compiled device code the + following should be implemented on backends site: + - AOT compilers must allow to compile SPIR-V modules with unresolved symbols + and produce device code in format that can be linked in run time and allows + to reduce JIT overhead + - OpenCL program binary type CL_PROGRAM_BINARY_TYPE_[COMPILED_OBJECT/LIBRARY] + should have native code format or any other format that can be emitted by AOT + compiler and allows to reduce JIT overhead #### Device images collection and linking From 7b604191aa8ac551721275052769791680bedffb Mon Sep 17 00:00:00 2001 From: Mariya Podchishchaeva Date: Mon, 31 May 2021 20:14:28 +0300 Subject: [PATCH 18/18] Add a note that it is not a final version --- sycl/doc/SharedLibraries.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/sycl/doc/SharedLibraries.md b/sycl/doc/SharedLibraries.md index ceb9d892065a3..e163c5a165f0f 100755 --- a/sycl/doc/SharedLibraries.md +++ b/sycl/doc/SharedLibraries.md @@ -3,6 +3,8 @@ This document describes purpose and design of dynamic linking of device code feature. +**NOTE**: This is not a final version. The document is still in progress. + ## Background Sometimes users want to link device code dynamically at run time. One possible use case for such linkage - providing device functions via shared libraries.