From 9454aeef13d23dbae25d1253580598874422b6b3 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Tue, 22 Sep 2020 12:13:51 +0300 Subject: [PATCH 1/9] [Doc] Add overview of kernel-program caching Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 82 ++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 sycl/doc/KernelProgramCache.md diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md new file mode 100644 index 0000000000000..4e5ed4ec3903b --- /dev/null +++ b/sycl/doc/KernelProgramCache.md @@ -0,0 +1,82 @@ +# A brief overview of kernel/program caching mechanism. + +The cache is employed when one submits kernel for execution or builds program or +kernel with SYCL API. At the same time programs and kernels are cached only when +they're built from C++ source, i.e. `program::build_with_kernel_type<>()` and +`program::get_kernel<>()` methods are employed. This restriction is implemented +via use of `program_impl::is_cacheable_with_options()` and +`program_impl::is_cacheable()` methods. The latter method only returns a boolean +flag which is set to false on default and is set to true in a single use-case. +One can find use-cases and cache filling in the [unit-tests](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). + +How does it work, i.e. at which point is the cache employed? At some point of +`ExecCGCommand`'s enqueue process the program manager's method will be called: +either `ProgramManager::getBuildPIProgram` or +`ProgramManager::getOrCreateKernel`. Now, both these methods will call template +function [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149) +with multiple lambdas passed to it: + - Acquire function; + - GetCache function; + - Build function. + +Acquire function returns a locked version of cache. Locking is employed for +thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. +when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149) +function. The rest of operation is done with the help of atomics and condition +variables (plus a mutex for proper work of condition variable). + +GetCache function returns a reference to mapping `key->value` out of locked +instance of cache. We will see rationale behind it a bit later. + +Build function actually builds the kernel or program. + +When we say "cache" we think about mapping of some key to value. These maps are +contained within [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) +class instance which on its own belongs to `context_impl` class instance. +Kernel cache is per program mapping of kernel name plus device pair to +`BuildResult`[1](#remove-pointer). When `getOrBuild` +function is called the key for kernel cache is pair/tuple of kernel name and +device. Program cache maps triple (spec consts, kernel set id, device) to +`BuildResult`[1](#remove-pointer). + +Now, we have a helper [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) +class. It's to use RAII to make code look cleaner a bit. Acquire function/lambda +will return a specialization of Locked class for reference of proper mapping. +Now, GetCache function will return the mapping to be employed i.e. it'll fetch +mapping of kernel name plus device to `BuildResult` for proper program as +`getOrBuild` will work with mapping of key (whichever it is) to `BuildResult` +specialization. + +`BuildResult` structure contains three information fields - pointer to built +kernel/program, build error (if applicable) and current build status +(either of "in progress", "succeeded", "failed"). + +Now, how `getOrBuild` works? +First, we fetch the cache with sequential calls to Acquire and GetCache +functions. Then, we check if we're the first ones who build this kernel/program. +This is achieved with attempt to insert another key-value pair into the map. +At this point we try to insert `BuildResult` stub instance with status equal to +"in progress" which will allow other threads to know that someone is (i.e. +we're) building the object (i.e. kernel or program) now. If insertion fails we +will wait for building thread to finish with call to `waitUntilBuilt` function. +This function will throw stored exception[2](#exception-data) upon +build failure. This allows waiting threads to result the same as the building +thread. Special case of the failure is when build result doesn't contain the +error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to +object in `BuildResult` instance is nil. In this case the building thread has +finished build process and returned the user an error. Though, this error could +be of spurious/sporadic nature. Hence, the waiting thread will try to build the +same object once more. + +`BuildResult` structure also contains synchronization objects: mutex and +condition variable. We employ them to signal waiting threads that the build +process for this kernl/program is finished (either successfuly or with a +failure). + + +1: The use of `std::remove_pointer` was omitted in sake of +simplicity here. + +2: Actually, we store contents of the exception: its message and +error code. + From d141bef684033620c9ea630433252867cb3e81a6 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Wed, 23 Sep 2020 18:10:08 +0300 Subject: [PATCH 2/9] [Doc] Address comment Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 211 ++++++++++++++++++++++++++------- 1 file changed, 170 insertions(+), 41 deletions(-) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index 4e5ed4ec3903b..a2ea03a60e95e 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -1,55 +1,184 @@ # A brief overview of kernel/program caching mechanism. +## Rationale behind caching + +*Use-case #1.* Looped enqueue of the same kernel: +```C++ + using namespace cl::sycl::queue; + + queue Q; + std::vector Bufs; + + ... + // initialize Bufs with some number of buffers + ... + + for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) { + Q.submit([&](handler &CGH) { + auto Acc = Bufs[Idx].get_access(CGH); + + CGH.parallel_for( + range<2>{N, M}, [=](item<2> Item) { ... }); + }); + } +``` + +*Use-case #2.* Enqueue of multiple kernels within a single program[1](#what-is-program): +```C++ + using namespace cl::sycl::queue; + + queue Q; + + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_1, M_1}, [=](item<2> Item) { ... }); + }); + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_2, M_2}, [=](item<2> Item) { ... }); + }); + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_3, M_3}, [=](item<2> Item) { ... }); + }); + ... + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_K, M_K}, [=](item<2> Item) { ... }); + }); +``` + +Both these use-cases will need to built the program or kernel multiple times. +When JIT is employed this process may take quite a lot of time. + +In order to eliminate this waste of run-time we introduce a kernel/program +caching. The cache is per-context and it caches underlying objects of non +interop kernels and programs which are built with no options. + +1: Here we use the term "program" in the same +sense as OpenCL does i.e. a set of kernels. + + +## Data structure of cache + +The cache stores underlying PI objects of `cl::sycl::program` and +`cl::sycl::kernel` in a per-context data storage. The storage consists of two +maps: one is for programs and the other is for kernels. + +Programs mapping's key consists of three components: +kernel set id[1](#what-is-ksid), specialized constants, device this +program is built for. + +Kernels mapping's key consists of three components too: program the kernel +belongs to, kernel name[2](#what-is-kname), device the program is +built for. + +1: Kernel set id is merely a number of translation +unit which contains at least one kernel. +2: Kernel name is mangled class name which is +provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or +`single_task`). + + +## Points of improvement (things to do) + + - Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517). + - Allow for caching of objects built with some build options. + - Employ the same built object for multiple devices of the same ISA, + capabilities and so on. *NOTE:* It's not really known if it's possible to + check if two distinct devices are *exactly* the same. + - Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). + + +## Implementation details + +The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) +class. The class is instantiated in a per-context manner. + +The `KernelProgramCache` is the storage descrived above. + + +### When does the cache come at work? + The cache is employed when one submits kernel for execution or builds program or -kernel with SYCL API. At the same time programs and kernels are cached only when -they're built from C++ source, i.e. `program::build_with_kernel_type<>()` and -`program::get_kernel<>()` methods are employed. This restriction is implemented -via use of `program_impl::is_cacheable_with_options()` and -`program_impl::is_cacheable()` methods. The latter method only returns a boolean -flag which is set to false on default and is set to true in a single use-case. -One can find use-cases and cache filling in the [unit-tests](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). - -How does it work, i.e. at which point is the cache employed? At some point of -`ExecCGCommand`'s enqueue process the program manager's method will be called: -either `ProgramManager::getBuildPIProgram` or -`ProgramManager::getOrCreateKernel`. Now, both these methods will call template -function [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149) +kernel with SYCL API. That means that the cache works when either user +explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()` +methods or SYCL RT builds or gets the required kernel. Cacheability of an object +is verified with `program_impl::is_cacheable()` method. SYCL RT will check if +program is cacheable and will get the kernel with call to +`ProgramManager::getOrCreateKernel()` method. + + +*NOTE:* a kernel is only cacheable if and only if the program it belongs to is +cacheable. On the other hand if the program is cacheable, then each and every +kernel of this program will be cached also. + + +Invoked by user `program::build_with_kernel_type<>()` and +`program::get_kernel<>()` methods will call either +`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()` +method respectively. Now, both these methods will call template +function [`getOrBuild()`](../source/detail/program_manager/program_manager.cpp#L149) with multiple lambdas passed to it: - Acquire function; - GetCache function; - Build function. -Acquire function returns a locked version of cache. Locking is employed for +*Acquire* function returns a locked version of cache. Locking is employed for thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149) function. The rest of operation is done with the help of atomics and condition variables (plus a mutex for proper work of condition variable). -GetCache function returns a reference to mapping `key->value` out of locked +*GetCache* function returns a reference to mapping `key->value` out of locked instance of cache. We will see rationale behind it a bit later. -Build function actually builds the kernel or program. - -When we say "cache" we think about mapping of some key to value. These maps are -contained within [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) -class instance which on its own belongs to `context_impl` class instance. -Kernel cache is per program mapping of kernel name plus device pair to -`BuildResult`[1](#remove-pointer). When `getOrBuild` -function is called the key for kernel cache is pair/tuple of kernel name and -device. Program cache maps triple (spec consts, kernel set id, device) to -`BuildResult`[1](#remove-pointer). - -Now, we have a helper [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) -class. It's to use RAII to make code look cleaner a bit. Acquire function/lambda -will return a specialization of Locked class for reference of proper mapping. -Now, GetCache function will return the mapping to be employed i.e. it'll fetch -mapping of kernel name plus device to `BuildResult` for proper program as -`getOrBuild` will work with mapping of key (whichever it is) to `BuildResult` -specialization. - -`BuildResult` structure contains three information fields - pointer to built -kernel/program, build error (if applicable) and current build status -(either of "in progress", "succeeded", "failed"). +*Build* function actually builds the kernel or program. + +Caching isn't done: + - when program is built out of source i.e. with + `program::build_with_source()` or `program::compile_with_source()` method; + - when program is result of linking of multiple programs. + + +### Thread-safety + +Why do we need thread safety here? It's quite possible to have a use-case when +the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a +queue). Possibility of enqueueing multiple cacheable kernels simultaneously +within multiple threads makes us to provide thread-safety for the cache. + +It's worth of noting that we don't cache the PI resource (kernel or program) +on it's own. Instead we augment the resource with the status of build process. +Hence, what is cached is a wrapper structure `BuildResult` which contains three +information fields - pointer to built resource, build error (if applicable) and +current build status (either of "in progress", "succeeded", "failed"). + +One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp). + +Pointer to built resource and build result are both atomic variables. Atomicity +of these variables allows one to hold lock on cache for quite a short time and +perform the rest of build/wait process without unwanted need of other threads to +wait on lock availability. + +A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) +for reference of proper mapping is returned by Acquire function. The use of this +class implements RAII to make code look cleaner a bit. Now, GetCache function +will return the mapping to be employed i.e. it'll fetch mapping of kernel name +plus device to `BuildResult` for proper program as `getOrBuild` will work with +mapping of key (whichever it is) to `BuildResult` specialization. The structure +is specialized with either `PiKernel` or `PiProgram`[1](#remove-program). + + +### Core of caching mechanism Now, how `getOrBuild` works? First, we fetch the cache with sequential calls to Acquire and GetCache @@ -74,9 +203,9 @@ process for this kernl/program is finished (either successfuly or with a failure). -1: The use of `std::remove_pointer` was omitted in sake of -simplicity here. +1: The use of `std::remove_pointer` was omitted in +sake of simplicity here. -2: Actually, we store contents of the exception: its message and -error code. +2: Actually, we store contents of the exception: +its message and error code. From 39b20f520159ab5a8ce6cf1d18454c0cd7098543 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Thu, 24 Sep 2020 11:54:12 +0300 Subject: [PATCH 3/9] Add line break Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 1 + 1 file changed, 1 insertion(+) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index a2ea03a60e95e..a8bbb6e93aaaf 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -83,6 +83,7 @@ built for. 1: Kernel set id is merely a number of translation unit which contains at least one kernel. + 2: Kernel name is mangled class name which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or `single_task`). From c16e4ad989483122273ebfdc17c7a9c75d4b4f42 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Thu, 24 Sep 2020 11:54:30 +0300 Subject: [PATCH 4/9] Add new file to TOC Signed-off-by: Sergey Kanaev --- sycl/doc/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/sycl/doc/index.rst b/sycl/doc/index.rst index 29a4501fc7e41..82eae7f8a0f4f 100644 --- a/sycl/doc/index.rst +++ b/sycl/doc/index.rst @@ -27,3 +27,4 @@ Developing oneAPI DPC++ Compiler EnvironmentVariables PluginInterface ABIPolicyGuide + KernelProgramCache From 5091adef6441c8dc90d31b3efb5960c44e4937b5 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Thu, 24 Sep 2020 12:16:15 +0300 Subject: [PATCH 5/9] Fix link Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index a8bbb6e93aaaf..d2c632583f511 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -127,7 +127,7 @@ Invoked by user `program::build_with_kernel_type<>()` and `program::get_kernel<>()` methods will call either `ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()` method respectively. Now, both these methods will call template -function [`getOrBuild()`](../source/detail/program_manager/program_manager.cpp#L149) +function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) with multiple lambdas passed to it: - Acquire function; - GetCache function; @@ -135,7 +135,7 @@ with multiple lambdas passed to it: *Acquire* function returns a locked version of cache. Locking is employed for thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. -when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149) +when calling to `map::insert` in [`getOrBuild`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) function. The rest of operation is done with the help of atomics and condition variables (plus a mutex for proper work of condition variable). From 40b7669091e95276e7c28c3d962ef7d30ba6b80c Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Wed, 30 Sep 2020 17:39:41 +0300 Subject: [PATCH 6/9] Address comments Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 136 ++++++++++++++++++--------------- 1 file changed, 73 insertions(+), 63 deletions(-) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index d2c632583f511..5747926db3a45 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -1,8 +1,15 @@ -# A brief overview of kernel/program caching mechanism. +# A brief overview of kernel and program caching mechanism. ## Rationale behind caching -*Use-case #1.* Looped enqueue of the same kernel: +During SYCL program execution SYCL runtime will create internal objects +representing kernels and programs, it may also invoke JIT compiler to bring +kernels in a program to executable state. Those runtime operations are quite +expensive, and in some cases caching approach can be employed to eliminate +redundant kernel or program object re-creation and online recompilation. Few +examples below illustrate scenarios where such optimization is possible. + +*Use-case #1.* Submission of the same kernel in a loop: ```C++ using namespace cl::sycl::queue; @@ -23,7 +30,7 @@ } ``` -*Use-case #2.* Enqueue of multiple kernels within a single program[1](#what-is-program): +*Use-case #2.* Submission of multiple kernels within a single program[1](#what-is-program): ```C++ using namespace cl::sycl::queue; @@ -56,36 +63,36 @@ }); ``` -Both these use-cases will need to built the program or kernel multiple times. -When JIT is employed this process may take quite a lot of time. +In both cases SYCL runtime will need to build the program and kernels multiple +times, which may involve JIT compilation and take quite a lot of time. -In order to eliminate this waste of run-time we introduce a kernel/program +In order to eliminate this waste of run-time we introduce a kernel and program caching. The cache is per-context and it caches underlying objects of non interop kernels and programs which are built with no options. -1: Here we use the term "program" in the same -sense as OpenCL does i.e. a set of kernels. +1: Here "program" means an internal SYCL runtime +object corresponding to a SPIRV module or native binary defining a set of SYCL +kernels and/or device functions. ## Data structure of cache -The cache stores underlying PI objects of `cl::sycl::program` and -`cl::sycl::kernel` in a per-context data storage. The storage consists of two -maps: one is for programs and the other is for kernels. +The cache stores underlying PI objects behind `cl::sycl::program` and +`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage +consists of two maps: one is for programs and the other is for kernels. -Programs mapping's key consists of three components: -kernel set id[1](#what-is-ksid), specialized constants, device this -program is built for. +The programs map's key consists of three components: kernel set id[1](#what-is-ksid), +specialized constants, device this program is built for. -Kernels mapping's key consists of three components too: program the kernel +The krnels map's key consists of three components too: program the kernel belongs to, kernel name[2](#what-is-kname), device the program is built for. -1: Kernel set id is merely a number of translation -unit which contains at least one kernel. +1: Kernel set id is an ordinal number of the device +binary image the kernel is contained in. -2: Kernel name is mangled class name which is -provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or +2: Kernel name is a kernel ID mangled class' name +which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or `single_task`). @@ -102,19 +109,23 @@ provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or ## Implementation details The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) -class. The class is instantiated in a per-context manner. +class. The runtime creates one instance of the class per distinct SYCL context +(A context object which is a result of copying another context object isn't +"distinct", as it corresponds to the same underlying internal object +representing a context). -The `KernelProgramCache` is the storage descrived above. +The `KernelProgramCache` is essentially a pair of maps as described above. ### When does the cache come at work? -The cache is employed when one submits kernel for execution or builds program or -kernel with SYCL API. That means that the cache works when either user -explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()` -methods or SYCL RT builds or gets the required kernel. Cacheability of an object -is verified with `program_impl::is_cacheable()` method. SYCL RT will check if -program is cacheable and will get the kernel with call to +The cache is used when one submits a kernel for execution or builds program or +with SYCL API. That means that the cache works when either user explicitly calls +`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL +RT builds a program or gets the required kernel as needed during application +execution. Cacheability of an object can be tested with +`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable +programs or kernels into the cache. This is done as a part of `ProgramManager::getOrCreateKernel()` method. @@ -123,12 +134,10 @@ cacheable. On the other hand if the program is cacheable, then each and every kernel of this program will be cached also. -Invoked by user `program::build_with_kernel_type<>()` and -`program::get_kernel<>()` methods will call either -`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()` -method respectively. Now, both these methods will call template -function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) -with multiple lambdas passed to it: +All requests to build a program or to create a kernel - whether they originate +from explicit user API calls or from internal SYCL runtime execution logic - end +up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) +with number of lambda functions passed as arguments: - Acquire function; - GetCache function; - Build function. @@ -145,20 +154,21 @@ instance of cache. We will see rationale behind it a bit later. *Build* function actually builds the kernel or program. Caching isn't done: - - when program is built out of source i.e. with - `program::build_with_source()` or `program::compile_with_source()` method; - - when program is result of linking of multiple programs. + - when program is built out of source with `program::build_with_source()` or + `program::compile_with_source()` method; + - when program is a result of linking multiple programs. ### Thread-safety -Why do we need thread safety here? It's quite possible to have a use-case when +Why do we need thread safety here? It is quite possible to have a use-case when the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a queue). Possibility of enqueueing multiple cacheable kernels simultaneously -within multiple threads makes us to provide thread-safety for the cache. +from multiple threads requires us to provide thread-safety for the caching +mechanisms. -It's worth of noting that we don't cache the PI resource (kernel or program) -on it's own. Instead we augment the resource with the status of build process. +It is worth of noting that we don't cache the PI resource (kernel or program) +by itself. Instead we augment the resource with the status of build process. Hence, what is cached is a wrapper structure `BuildResult` which contains three information fields - pointer to built resource, build error (if applicable) and current build status (either of "in progress", "succeeded", "failed"). @@ -167,36 +177,36 @@ One can find definition of `BuildResult` template in [KernelProgramCache](https: Pointer to built resource and build result are both atomic variables. Atomicity of these variables allows one to hold lock on cache for quite a short time and -perform the rest of build/wait process without unwanted need of other threads to -wait on lock availability. +perform the rest of build/wait process without forcing other threads to wait on +lock availability. A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) for reference of proper mapping is returned by Acquire function. The use of this class implements RAII to make code look cleaner a bit. Now, GetCache function -will return the mapping to be employed i.e. it'll fetch mapping of kernel name -plus device to `BuildResult` for proper program as `getOrBuild` will work with -mapping of key (whichever it is) to `BuildResult` specialization. The structure -is specialized with either `PiKernel` or `PiProgram`[1](#remove-program). +will return the mapping to be employed that includes the 3 components: kernel +name, device as well as any specialization constants. These get added to +`BuildResult` and are cached. The `BuildResult` structure is specialized with +either `PiKernel` or `PiProgram`[1](#remove-program). ### Core of caching mechanism -Now, how `getOrBuild` works? +Now, let us see how 'getOrBuild' function works. First, we fetch the cache with sequential calls to Acquire and GetCache -functions. Then, we check if we're the first ones who build this kernel/program. -This is achieved with attempt to insert another key-value pair into the map. -At this point we try to insert `BuildResult` stub instance with status equal to -"in progress" which will allow other threads to know that someone is (i.e. -we're) building the object (i.e. kernel or program) now. If insertion fails we -will wait for building thread to finish with call to `waitUntilBuilt` function. -This function will throw stored exception[2](#exception-data) upon -build failure. This allows waiting threads to result the same as the building -thread. Special case of the failure is when build result doesn't contain the -error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to -object in `BuildResult` instance is nil. In this case the building thread has -finished build process and returned the user an error. Though, this error could -be of spurious/sporadic nature. Hence, the waiting thread will try to build the -same object once more. +functions. Then, we check if this is the first attempt to build this kernel or +program. This is achieved with an attempt to insert another key-value pair into +the map. At this point we try to insert `BuildResult` stub instance with status +equal to "in progress" which will allow other threads to know that someone is +(i.e. we're) building the object (i.e. kernel or program) now. If insertion +fails, we will wait for building thread to finish with call to `waitUntilBuilt` +function. This function will throw stored exception[2](#exception-data) +upon build failure. This allows waiting threads to see the same result as the +building thread. Special case of the failure is when build result doesn't +contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the +pointer to object in `BuildResult` instance is nil. In this case, the building +thread has finished the build process and has returned an error to the user. +But this error may be sporadic in nature and may be spurious. Hence, the waiting +thread will try to build the same object once more. `BuildResult` structure also contains synchronization objects: mutex and condition variable. We employ them to signal waiting threads that the build @@ -204,8 +214,8 @@ process for this kernl/program is finished (either successfuly or with a failure). -1: The use of `std::remove_pointer` was omitted in -sake of simplicity here. +1: The use of `std::remove_pointer` was omitted for +the sake of simplicity here. 2: Actually, we store contents of the exception: its message and error code. From 327715391eb188930bea62bceb97365a8c8b3957 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Mon, 5 Oct 2020 13:44:40 +0300 Subject: [PATCH 7/9] Add note Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index 5747926db3a45..575f3fa38e77e 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -102,7 +102,8 @@ which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or - Allow for caching of objects built with some build options. - Employ the same built object for multiple devices of the same ISA, capabilities and so on. *NOTE:* It's not really known if it's possible to - check if two distinct devices are *exactly* the same. + check if two distinct devices are *exactly* the same. Probably this should be + an improvement request for plugins. - Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). From 0142d6a4637c4b78421a115f443cd0a283cf1e06 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Tue, 6 Oct 2020 12:08:14 +0300 Subject: [PATCH 8/9] Address comments. Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index 575f3fa38e77e..8a0c5c06d8b0b 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -78,7 +78,7 @@ kernels and/or device functions. ## Data structure of cache The cache stores underlying PI objects behind `cl::sycl::program` and -`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage +`cl::sycl::kernel` user-level objects in a per-context data storage. The storage consists of two maps: one is for programs and the other is for kernels. The programs map's key consists of three components: kernel set id[1](#what-is-ksid), @@ -187,7 +187,7 @@ class implements RAII to make code look cleaner a bit. Now, GetCache function will return the mapping to be employed that includes the 3 components: kernel name, device as well as any specialization constants. These get added to `BuildResult` and are cached. The `BuildResult` structure is specialized with -either `PiKernel` or `PiProgram`[1](#remove-program). +either `PiKernel` or `PiProgram`[1](#remove-pointer). ### Core of caching mechanism From fc153b40436402672538a56496f7cdcbff600e02 Mon Sep 17 00:00:00 2001 From: Sergey Kanaev Date: Fri, 9 Oct 2020 16:02:16 +0300 Subject: [PATCH 9/9] Address comments. Signed-off-by: Sergey Kanaev --- sycl/doc/KernelProgramCache.md | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md index 8a0c5c06d8b0b..88d1e11881ce8 100644 --- a/sycl/doc/KernelProgramCache.md +++ b/sycl/doc/KernelProgramCache.md @@ -176,10 +176,35 @@ current build status (either of "in progress", "succeeded", "failed"). One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp). -Pointer to built resource and build result are both atomic variables. Atomicity -of these variables allows one to hold lock on cache for quite a short time and -perform the rest of build/wait process without forcing other threads to wait on -lock availability. +The built resource access synchronization approach aims at minimizing the time +any thread holds the global lock guarding the maps to improve performance. To +achieve that, the global lock is acquired only for the duration of the global +map access. Actual build of the program happens outside of the lock, so other +threads can request or build other programs in the meantime. A thread requesting +a `BuildResult` instance via `getOrBuild` can go one of three ways: + A) Build result is **not** available, it is the first thread to request it. + Current thread will then execute the build letting others wait for the + result using the per-build result condition variable kept in `BuildResult`'s + `MBuildCV` field. + B) Build result is **not** available, another thread is already building the + result. Current thread will then wait for the result using the `MBuildCV` + condition variable. + C) Build result **is** available. The thread simply takes it from the `Ptr` + field w/o using any mutexes or condition variables. + +As noted before, access to `BuildResult` instance fields may occur from +different threads simultaneously, but the global lock is no longer held. So, to +make it safe and to make sure only one thread builds the requested program, the +following is done: + - program build state is reflected in the `State` field, threads use + compare-and-swap technique to compete who will do the build and become thread + A. Threads C will find 'DONE' in this field and immediately return the with + built result at hand. + - thread A and thread(s) B use the `MBuildCV` conditional variable field and + `MBuildResultMutex` mutex field guarding that variable to implement the + "single producer-multiple consumers scheme". + - the build result itself appears in the 'Ptr' field when available. +All fields are atomic because they can be accessed from multiple threads. A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) for reference of proper mapping is returned by Acquire function. The use of this