From 9454aeef13d23dbae25d1253580598874422b6b3 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Tue, 22 Sep 2020 12:13:51 +0300
Subject: [PATCH 1/9] [Doc] Add overview of kernel-program caching

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 82 ++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)
 create mode 100644 sycl/doc/KernelProgramCache.md
diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
new file mode 100644
index 0000000000000..4e5ed4ec3903b
--- /dev/null
+++ b/sycl/doc/KernelProgramCache.md
@@ -0,0 +1,82 @@
+# A brief overview of kernel/program caching mechanism.
+
+The cache is employed when one submits kernel for execution or builds program or
+kernel with SYCL API. At the same time programs and kernels are cached only when
+they're built from C++ source, i.e. `program::build_with_kernel_type<>()` and
+`program::get_kernel<>()` methods are employed. This restriction is implemented
+via use of `program_impl::is_cacheable_with_options()` and
+`program_impl::is_cacheable()` methods. The latter method only returns a boolean
+flag which is set to false on default and is set to true in a single use-case.
+One can find use-cases and cache filling in the [unit-tests](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
+
+How does it work, i.e. at which point is the cache employed? At some point of
+`ExecCGCommand`'s enqueue process the program manager's method will be called:
+either `ProgramManager::getBuildPIProgram` or
+`ProgramManager::getOrCreateKernel`. Now, both these methods will call template
+function [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
+with multiple lambdas passed to it:
+ - Acquire function;
+ - GetCache function;
+ - Build function.
+
+Acquire function returns a locked version of cache. Locking is employed for
+thread safety. The threads are blocked only for insert-or-acquire attempt, i.e.
+when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
+function. The rest of operation is done with the help of atomics and condition
+variables (plus a mutex for proper work of condition variable).
+
+GetCache function returns a reference to mapping `key->value` out of locked
+instance of cache. We will see rationale behind it a bit later.
+
+Build function actually builds the kernel or program.
+
+When we say "cache" we think about mapping of some key to value. These maps are
+contained within [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) 
+class instance which on its own belongs to `context_impl` class instance.
+Kernel cache is per program mapping of kernel name plus device pair to
+`BuildResult<PiKernel>`<sup>[1](#remove-pointer)</sup>. When `getOrBuild`
+function is called the key for kernel cache is pair/tuple of kernel name and
+device. Program cache maps triple (spec consts, kernel set id, device) to
+`BuildResult<PiProgram>`<sup>[1](#remove-pointer)</sup>.
+
+Now, we have a helper [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
+class. It's to use RAII to make code look cleaner a bit. Acquire function/lambda
+will return a specialization of Locked class for reference of proper mapping.
+Now, GetCache function will return the mapping to be employed i.e. it'll fetch
+mapping of kernel name plus device to `BuildResult` for proper program as
+`getOrBuild` will work with mapping of key (whichever it is) to `BuildResult`
+specialization.
+
+`BuildResult` structure contains three information fields - pointer to built
+kernel/program, build error (if applicable) and current build status
+(either of "in progress", "succeeded", "failed").
+
+Now, how `getOrBuild` works?
+First, we fetch the cache with sequential calls to Acquire and GetCache
+functions. Then, we check if we're the first ones who build this kernel/program.
+This is achieved with attempt to insert another key-value pair into the map.
+At this point we try to insert `BuildResult` stub instance with status equal to
+"in progress" which will allow other threads to know that someone is (i.e.
+we're) building the object (i.e. kernel or program) now. If insertion fails we
+will wait for building thread to finish with call to `waitUntilBuilt` function.
+This function will throw stored exception<sup>[2](#exception-data)</sup> upon
+build failure. This allows waiting threads to result the same as the building
+thread. Special case of the failure is when build result doesn't contain the
+error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to
+object in `BuildResult` instance is nil. In this case the building thread has
+finished build process and returned the user an error. Though, this error could
+be of spurious/sporadic nature. Hence, the waiting thread will try to build the
+same object once more.
+
+`BuildResult` structure also contains synchronization objects: mutex and
+condition variable. We employ them to signal waiting threads that the build
+process for this kernl/program is finished (either successfuly or with a
+failure).
+
+
+<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in sake of
+simplicity here.
+
+<a name="exception-data">2</a>: Actually, we store contents of the exception: its message and
+error code.
+

From d141bef684033620c9ea630433252867cb3e81a6 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Wed, 23 Sep 2020 18:10:08 +0300
Subject: [PATCH 2/9] [Doc] Address comment

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 211 ++++++++++++++++++++++++++-------
 1 file changed, 170 insertions(+), 41 deletions(-)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index 4e5ed4ec3903b..a2ea03a60e95e 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -1,55 +1,184 @@
 # A brief overview of kernel/program caching mechanism.
 
+## Rationale behind caching
+
+*Use-case #1.* Looped enqueue of the same kernel:
+```C++
+  using namespace cl::sycl::queue;
+
+  queue Q;
+  std::vector<buffer> Bufs;
+
+  ...
+  // initialize Bufs with some number of buffers
+  ...
+
+  for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) {
+    Q.submit([&](handler &CGH) {
+      auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH);
+
+      CGH.parallel_for<class TheKernel>(
+          range<2>{N, M}, [=](item<2> Item) { ... });
+    });
+  }
+```
+
+*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
+```C++
+  using namespace cl::sycl::queue;
+
+  queue Q;
+
+  Q.submit([&](handler &CGH) {
+    ...
+
+    CGH.parallel_for<class TheKernel_1>(
+        range<2>{N_1, M_1}, [=](item<2> Item) { ... });
+  });
+  Q.submit([&](handler &CGH) {
+    ...
+
+    CGH.parallel_for<class TheKernel_2>(
+        range<2>{N_2, M_2}, [=](item<2> Item) { ... });
+  });
+  Q.submit([&](handler &CGH) {
+    ...
+
+    CGH.parallel_for<class TheKernel_3>(
+        range<2>{N_3, M_3}, [=](item<2> Item) { ... });
+  });
+  ...
+  Q.submit([&](handler &CGH) {
+    ...
+
+    CGH.parallel_for<class TheKernel_K>(
+        range<2>{N_K, M_K}, [=](item<2> Item) { ... });
+  });
+```
+
+Both these use-cases will need to built the program or kernel multiple times.
+When JIT is employed this process may take quite a lot of time.
+
+In order to eliminate this waste of run-time we introduce a kernel/program
+caching. The cache is per-context and it caches underlying objects of non
+interop kernels and programs which are built with no options.
+
+<a name="what-is-program">1</a>: Here we use the term "program" in the same
+sense as OpenCL does i.e. a set of kernels.
+
+
+## Data structure of cache
+
+The cache stores underlying PI objects of `cl::sycl::program` and
+`cl::sycl::kernel` in a per-context data storage. The storage consists of two
+maps: one is for programs and the other is for kernels.
+
+Programs mapping's key consists of three components:
+kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
+program is built for.
+
+Kernels mapping's key consists of three components too: program the kernel
+belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
+built for.
+
+<a name="what-is-ksid">1</a>: Kernel set id is merely a number of translation
+unit which contains at least one kernel.
+<a name="what-is-kname">2</a>: Kernel name is mangled class name which is
+provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
+`single_task`).
+
+
+## Points of improvement (things to do)
+
+ - Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517).
+ - Allow for caching of objects built with some build options.
+ - Employ the same built object for multiple devices of the same ISA,
+   capabilities and so on. *NOTE:* It's not really known if it's possible to
+   check if two distinct devices are *exactly* the same.
+ - Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
+
+
+## Implementation details
+
+The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
+class. The class is instantiated in a per-context manner.
+
+The `KernelProgramCache` is the storage descrived above.
+
+
+### When does the cache come at work?
+
 The cache is employed when one submits kernel for execution or builds program or
-kernel with SYCL API. At the same time programs and kernels are cached only when
-they're built from C++ source, i.e. `program::build_with_kernel_type<>()` and
-`program::get_kernel<>()` methods are employed. This restriction is implemented
-via use of `program_impl::is_cacheable_with_options()` and
-`program_impl::is_cacheable()` methods. The latter method only returns a boolean
-flag which is set to false on default and is set to true in a single use-case.
-One can find use-cases and cache filling in the [unit-tests](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
-
-How does it work, i.e. at which point is the cache employed? At some point of
-`ExecCGCommand`'s enqueue process the program manager's method will be called:
-either `ProgramManager::getBuildPIProgram` or
-`ProgramManager::getOrCreateKernel`. Now, both these methods will call template
-function [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
+kernel with SYCL API. That means that the cache works when either user
+explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()`
+methods or SYCL RT builds or gets the required kernel. Cacheability of an object
+is verified with `program_impl::is_cacheable()` method. SYCL RT will check if
+program is cacheable and will get the kernel with call to
+`ProgramManager::getOrCreateKernel()` method.
+
+
+*NOTE:* a kernel is only cacheable if and only if the program it belongs to is
+cacheable. On the other hand if the program is cacheable, then each and every
+kernel of this program will be cached also.
+
+
+Invoked by user `program::build_with_kernel_type<>()` and
+`program::get_kernel<>()` methods will call either
+`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
+method respectively. Now, both these methods will call template
+function [`getOrBuild()`](../source/detail/program_manager/program_manager.cpp#L149)
 with multiple lambdas passed to it:
  - Acquire function;
  - GetCache function;
  - Build function.
 
-Acquire function returns a locked version of cache. Locking is employed for
+*Acquire* function returns a locked version of cache. Locking is employed for
 thread safety. The threads are blocked only for insert-or-acquire attempt, i.e.
 when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
 function. The rest of operation is done with the help of atomics and condition
 variables (plus a mutex for proper work of condition variable).
 
-GetCache function returns a reference to mapping `key->value` out of locked
+*GetCache* function returns a reference to mapping `key->value` out of locked
 instance of cache. We will see rationale behind it a bit later.
 
-Build function actually builds the kernel or program.
-
-When we say "cache" we think about mapping of some key to value. These maps are
-contained within [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) 
-class instance which on its own belongs to `context_impl` class instance.
-Kernel cache is per program mapping of kernel name plus device pair to
-`BuildResult<PiKernel>`<sup>[1](#remove-pointer)</sup>. When `getOrBuild`
-function is called the key for kernel cache is pair/tuple of kernel name and
-device. Program cache maps triple (spec consts, kernel set id, device) to
-`BuildResult<PiProgram>`<sup>[1](#remove-pointer)</sup>.
-
-Now, we have a helper [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
-class. It's to use RAII to make code look cleaner a bit. Acquire function/lambda
-will return a specialization of Locked class for reference of proper mapping.
-Now, GetCache function will return the mapping to be employed i.e. it'll fetch
-mapping of kernel name plus device to `BuildResult` for proper program as
-`getOrBuild` will work with mapping of key (whichever it is) to `BuildResult`
-specialization.
-
-`BuildResult` structure contains three information fields - pointer to built
-kernel/program, build error (if applicable) and current build status
-(either of "in progress", "succeeded", "failed").
+*Build* function actually builds the kernel or program.
+
+Caching isn't done:
+ - when program is built out of source i.e. with
+   `program::build_with_source()` or `program::compile_with_source()` method;
+ - when program is result of linking of multiple programs.
+
+
+### Thread-safety
+
+Why do we need thread safety here? It's quite possible to have a use-case when
+the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
+queue). Possibility of enqueueing multiple cacheable kernels simultaneously
+within multiple threads makes us to provide thread-safety for the cache.
+
+It's worth of noting that we don't cache the PI resource (kernel or program)
+on it's own. Instead we augment the resource with the status of build process.
+Hence, what is cached is a wrapper structure `BuildResult` which contains three
+information fields - pointer to built resource, build error (if applicable) and
+current build status (either of "in progress", "succeeded", "failed").
+
+One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp).
+
+Pointer to built resource and build result are both atomic variables. Atomicity
+of these variables allows one to hold lock on cache for quite a short time and
+perform the rest of build/wait process without unwanted need of other threads to
+wait on lock availability.
+
+A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
+for reference of proper mapping is returned by Acquire function. The use of this
+class implements RAII to make code look cleaner a bit. Now, GetCache function
+will return the mapping to be employed i.e. it'll fetch mapping of kernel name
+plus device to `BuildResult` for proper program as `getOrBuild` will work with
+mapping of key (whichever it is) to `BuildResult` specialization. The structure
+is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
+
+
+### Core of caching mechanism
 
 Now, how `getOrBuild` works?
 First, we fetch the cache with sequential calls to Acquire and GetCache
@@ -74,9 +203,9 @@ process for this kernl/program is finished (either successfuly or with a
 failure).
 
 
-<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in sake of
-simplicity here.
+<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
+sake of simplicity here.
 
-<a name="exception-data">2</a>: Actually, we store contents of the exception: its message and
-error code.
+<a name="exception-data">2</a>: Actually, we store contents of the exception:
+its message and error code.
 

From 39b20f520159ab5a8ce6cf1d18454c0cd7098543 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Thu, 24 Sep 2020 11:54:12 +0300
Subject: [PATCH 3/9] Add line break

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index a2ea03a60e95e..a8bbb6e93aaaf 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -83,6 +83,7 @@ built for.
 
 <a name="what-is-ksid">1</a>: Kernel set id is merely a number of translation
 unit which contains at least one kernel.
+
 <a name="what-is-kname">2</a>: Kernel name is mangled class name which is
 provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
 `single_task`).

From c16e4ad989483122273ebfdc17c7a9c75d4b4f42 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Thu, 24 Sep 2020 11:54:30 +0300
Subject: [PATCH 4/9] Add new file to TOC

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sycl/doc/index.rst b/sycl/doc/index.rst
index 29a4501fc7e41..82eae7f8a0f4f 100644
--- a/sycl/doc/index.rst
+++ b/sycl/doc/index.rst
@@ -27,3 +27,4 @@ Developing oneAPI DPC++ Compiler
    EnvironmentVariables
    PluginInterface
    ABIPolicyGuide
+   KernelProgramCache

From 5091adef6441c8dc90d31b3efb5960c44e4937b5 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Thu, 24 Sep 2020 12:16:15 +0300
Subject: [PATCH 5/9] Fix link

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index a8bbb6e93aaaf..d2c632583f511 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -127,7 +127,7 @@ Invoked by user `program::build_with_kernel_type<>()` and
 `program::get_kernel<>()` methods will call either
 `ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
 method respectively. Now, both these methods will call template
-function [`getOrBuild()`](../source/detail/program_manager/program_manager.cpp#L149)
+function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
 with multiple lambdas passed to it:
  - Acquire function;
  - GetCache function;
@@ -135,7 +135,7 @@ with multiple lambdas passed to it:
 
 *Acquire* function returns a locked version of cache. Locking is employed for
 thread safety. The threads are blocked only for insert-or-acquire attempt, i.e.
-when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
+when calling to `map::insert` in [`getOrBuild`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
 function. The rest of operation is done with the help of atomics and condition
 variables (plus a mutex for proper work of condition variable).
 

From 40b7669091e95276e7c28c3d962ef7d30ba6b80c Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Wed, 30 Sep 2020 17:39:41 +0300
Subject: [PATCH 6/9] Address comments

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 136 ++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 63 deletions(-)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index d2c632583f511..5747926db3a45 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -1,8 +1,15 @@
-# A brief overview of kernel/program caching mechanism.
+# A brief overview of kernel and program caching mechanism.
 
 ## Rationale behind caching
 
-*Use-case #1.* Looped enqueue of the same kernel:
+During SYCL program execution SYCL runtime will create internal objects
+representing kernels and programs, it may also invoke JIT compiler to bring
+kernels in a program to executable state. Those runtime operations are quite
+expensive, and in some cases caching approach can be employed to eliminate
+redundant kernel or program object re-creation and online recompilation. Few
+examples below illustrate scenarios where such optimization is possible.
+
+*Use-case #1.* Submission of the same kernel in a loop:
 ```C++
   using namespace cl::sycl::queue;
 
@@ -23,7 +30,7 @@
   }
 ```
 
-*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
+*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
 ```C++
   using namespace cl::sycl::queue;
 
@@ -56,36 +63,36 @@
   });
 ```
 
-Both these use-cases will need to built the program or kernel multiple times.
-When JIT is employed this process may take quite a lot of time.
+In both cases SYCL runtime will need to build the program and kernels multiple
+times, which may involve JIT compilation and take quite a lot of time.
 
-In order to eliminate this waste of run-time we introduce a kernel/program
+In order to eliminate this waste of run-time we introduce a kernel and program
 caching. The cache is per-context and it caches underlying objects of non
 interop kernels and programs which are built with no options.
 
-<a name="what-is-program">1</a>: Here we use the term "program" in the same
-sense as OpenCL does i.e. a set of kernels.
+<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime
+object corresponding to a SPIRV module or native binary defining a set of SYCL
+kernels and/or device functions.
 
 
 ## Data structure of cache
 
-The cache stores underlying PI objects of `cl::sycl::program` and
-`cl::sycl::kernel` in a per-context data storage. The storage consists of two
-maps: one is for programs and the other is for kernels.
+The cache stores underlying PI objects behind `cl::sycl::program` and
+`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage
+consists of two maps: one is for programs and the other is for kernels.
 
-Programs mapping's key consists of three components:
-kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
-program is built for.
+The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
+specialized constants, device this program is built for.
 
-Kernels mapping's key consists of three components too: program the kernel
+The krnels map's key consists of three components too: program the kernel
 belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
 built for.
 
-<a name="what-is-ksid">1</a>: Kernel set id is merely a number of translation
-unit which contains at least one kernel.
+<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device
+binary image the kernel is contained in.
 
-<a name="what-is-kname">2</a>: Kernel name is mangled class name which is
-provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
+<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name
+which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
 `single_task`).
 
 
@@ -102,19 +109,23 @@ provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
 ## Implementation details
 
 The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
-class. The class is instantiated in a per-context manner.
+class. The runtime creates one instance of the class per distinct SYCL context
+(A context object which is a result of copying another context object isn't
+"distinct", as it corresponds to the same underlying internal object
+representing a context).
 
-The `KernelProgramCache` is the storage descrived above.
+The `KernelProgramCache` is essentially a pair of maps as described above.
 
 
 ### When does the cache come at work?
 
-The cache is employed when one submits kernel for execution or builds program or
-kernel with SYCL API. That means that the cache works when either user
-explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()`
-methods or SYCL RT builds or gets the required kernel. Cacheability of an object
-is verified with `program_impl::is_cacheable()` method. SYCL RT will check if
-program is cacheable and will get the kernel with call to
+The cache is used when one submits a kernel for execution or builds program or
+with SYCL API. That means that the cache works when either user explicitly calls
+`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
+RT builds a program or gets the required kernel as needed during application
+execution. Cacheability of an object can be tested with
+`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
+programs or kernels into the cache. This is done as a part of
 `ProgramManager::getOrCreateKernel()` method.
 
 
@@ -123,12 +134,10 @@ cacheable. On the other hand if the program is cacheable, then each and every
 kernel of this program will be cached also.
 
 
-Invoked by user `program::build_with_kernel_type<>()` and
-`program::get_kernel<>()` methods will call either
-`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
-method respectively. Now, both these methods will call template
-function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
-with multiple lambdas passed to it:
+All requests to build a program or to create a kernel - whether they originate
+from explicit user API calls or from internal SYCL runtime execution logic - end
+up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
+with number of lambda functions passed as arguments:
  - Acquire function;
  - GetCache function;
  - Build function.
@@ -145,20 +154,21 @@ instance of cache. We will see rationale behind it a bit later.
 *Build* function actually builds the kernel or program.
 
 Caching isn't done:
- - when program is built out of source i.e. with
-   `program::build_with_source()` or `program::compile_with_source()` method;
- - when program is result of linking of multiple programs.
+ - when program is built out of source with `program::build_with_source()` or
+   `program::compile_with_source()` method;
+ - when program is a result of linking multiple programs.
 
 
 ### Thread-safety
 
-Why do we need thread safety here? It's quite possible to have a use-case when
+Why do we need thread safety here? It is quite possible to have a use-case when
 the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
 queue). Possibility of enqueueing multiple cacheable kernels simultaneously
-within multiple threads makes us to provide thread-safety for the cache.
+from multiple threads requires us to provide thread-safety for the caching
+mechanisms.
 
-It's worth of noting that we don't cache the PI resource (kernel or program)
-on it's own. Instead we augment the resource with the status of build process.
+It is worth of noting that we don't cache the PI resource (kernel or program)
+by itself. Instead we augment the resource with the status of build process.
 Hence, what is cached is a wrapper structure `BuildResult` which contains three
 information fields - pointer to built resource, build error (if applicable) and
 current build status (either of "in progress", "succeeded", "failed").
@@ -167,36 +177,36 @@ One can find definition of `BuildResult` template in [KernelProgramCache](https:
 
 Pointer to built resource and build result are both atomic variables. Atomicity
 of these variables allows one to hold lock on cache for quite a short time and
-perform the rest of build/wait process without unwanted need of other threads to
-wait on lock availability.
+perform the rest of build/wait process without forcing other threads to wait on
+lock availability.
 
 A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
 for reference of proper mapping is returned by Acquire function. The use of this
 class implements RAII to make code look cleaner a bit. Now, GetCache function
-will return the mapping to be employed i.e. it'll fetch mapping of kernel name
-plus device to `BuildResult` for proper program as `getOrBuild` will work with
-mapping of key (whichever it is) to `BuildResult` specialization. The structure
-is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
+will return the mapping to be employed that includes the 3 components: kernel
+name, device as well as any specialization constants. These get added to
+`BuildResult` and are cached. The `BuildResult` structure is specialized with
+either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
 
 
 ### Core of caching mechanism
 
-Now, how `getOrBuild` works?
+Now, let us see how 'getOrBuild' function works.
 First, we fetch the cache with sequential calls to Acquire and GetCache
-functions. Then, we check if we're the first ones who build this kernel/program.
-This is achieved with attempt to insert another key-value pair into the map.
-At this point we try to insert `BuildResult` stub instance with status equal to
-"in progress" which will allow other threads to know that someone is (i.e.
-we're) building the object (i.e. kernel or program) now. If insertion fails we
-will wait for building thread to finish with call to `waitUntilBuilt` function.
-This function will throw stored exception<sup>[2](#exception-data)</sup> upon
-build failure. This allows waiting threads to result the same as the building
-thread. Special case of the failure is when build result doesn't contain the
-error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to
-object in `BuildResult` instance is nil. In this case the building thread has
-finished build process and returned the user an error. Though, this error could
-be of spurious/sporadic nature. Hence, the waiting thread will try to build the
-same object once more.
+functions. Then, we check if this is the first attempt to build this kernel or
+program. This is achieved with an attempt to insert another key-value pair into
+the map. At this point we try to insert `BuildResult` stub instance with status
+equal to "in progress" which will allow other threads to know that someone is
+(i.e. we're) building the object (i.e. kernel or program) now. If insertion
+fails, we will wait for building thread to finish with call to `waitUntilBuilt`
+function. This function will throw stored exception<sup>[2](#exception-data)</sup>
+upon build failure. This allows waiting threads to see the same result as the
+building thread. Special case of the failure is when build result doesn't
+contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the
+pointer to object in `BuildResult` instance is nil. In this case, the building
+thread has finished the build process and has returned an error to the user.
+But this error may be sporadic in nature and may be spurious. Hence, the waiting
+thread will try to build the same object once more.
 
 `BuildResult` structure also contains synchronization objects: mutex and
 condition variable. We employ them to signal waiting threads that the build
@@ -204,8 +214,8 @@ process for this kernl/program is finished (either successfuly or with a
 failure).
 
 
-<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
-sake of simplicity here.
+<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
+the sake of simplicity here.
 
 <a name="exception-data">2</a>: Actually, we store contents of the exception:
 its message and error code.

From 327715391eb188930bea62bceb97365a8c8b3957 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Mon, 5 Oct 2020 13:44:40 +0300
Subject: [PATCH 7/9] Add note

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index 5747926db3a45..575f3fa38e77e 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -102,7 +102,8 @@ which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
  - Allow for caching of objects built with some build options.
  - Employ the same built object for multiple devices of the same ISA,
    capabilities and so on. *NOTE:* It's not really known if it's possible to
-   check if two distinct devices are *exactly* the same.
+   check if two distinct devices are *exactly* the same. Probably this should be
+   an improvement request for plugins.
  - Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
 
 

From 0142d6a4637c4b78421a115f443cd0a283cf1e06 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Tue, 6 Oct 2020 12:08:14 +0300
Subject: [PATCH 8/9] Address comments.

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index 575f3fa38e77e..8a0c5c06d8b0b 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -78,7 +78,7 @@ kernels and/or device functions.
 ## Data structure of cache
 
 The cache stores underlying PI objects behind `cl::sycl::program` and
-`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage
+`cl::sycl::kernel` user-level objects in a per-context data storage. The storage
 consists of two maps: one is for programs and the other is for kernels.
 
 The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
@@ -187,7 +187,7 @@ class implements RAII to make code look cleaner a bit. Now, GetCache function
 will return the mapping to be employed that includes the 3 components: kernel
 name, device as well as any specialization constants. These get added to
 `BuildResult` and are cached. The `BuildResult` structure is specialized with
-either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
+either `PiKernel` or `PiProgram`<sup>[1](#remove-pointer)</sup>.
 
 
 ### Core of caching mechanism

From fc153b40436402672538a56496f7cdcbff600e02 Mon Sep 17 00:00:00 2001
From: Sergey Kanaev <sergey.kanaev@intel.com>
Date: Fri, 9 Oct 2020 16:02:16 +0300
Subject: [PATCH 9/9] Address comments.

Signed-off-by: Sergey Kanaev <sergey.kanaev@intel.com>
---
 sycl/doc/KernelProgramCache.md | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
index 8a0c5c06d8b0b..88d1e11881ce8 100644
--- a/sycl/doc/KernelProgramCache.md
+++ b/sycl/doc/KernelProgramCache.md
@@ -176,10 +176,35 @@ current build status (either of "in progress", "succeeded", "failed").
 
 One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp).
 
-Pointer to built resource and build result are both atomic variables. Atomicity
-of these variables allows one to hold lock on cache for quite a short time and
-perform the rest of build/wait process without forcing other threads to wait on
-lock availability.
+The built resource access synchronization approach aims at minimizing the time
+any thread holds the global lock guarding the maps to improve performance. To
+achieve that, the global lock is acquired only for the duration of the global
+map access. Actual build of the program happens outside of the lock, so other
+threads can request or build other programs in the meantime. A thread requesting
+a `BuildResult` instance via `getOrBuild` can go one of three ways:
+ A) Build result is **not** available, it is the first thread to request it.
+    Current thread will then execute the build letting others wait for the
+    result using the per-build result condition variable kept in `BuildResult`'s
+    `MBuildCV` field.
+ B) Build result is **not** available, another thread is already building the
+    result. Current thread will then wait for the result using the `MBuildCV`
+    condition variable.
+ C) Build result **is** available. The thread simply takes it from the `Ptr`
+    field w/o using any mutexes or condition variables.
+
+As noted before, access to `BuildResult` instance fields may occur from
+different threads simultaneously, but the global lock is no longer held. So, to
+make it safe and to make sure only one thread builds the requested program, the
+following is done:
+ - program build state is reflected in the `State` field, threads use
+   compare-and-swap technique to compete who will do the build and become thread
+   A. Threads C will find 'DONE' in this field and immediately return the with
+   built result at hand.
+ - thread A and thread(s) B use the `MBuildCV` conditional variable field and
+   `MBuildResultMutex`  mutex field guarding that variable to implement the
+   "single producer-multiple consumers scheme".
+ - the build result itself appears in the 'Ptr' field when available.
+All fields are atomic because they can be accessed from multiple threads.
 
 A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
 for reference of proper mapping is returned by Acquire function. The use of this