uxlfoundation · anton-potapov · Mar 21, 2022 · Aug 3, 2021 · Mar 9, 2022 · Mar 10, 2022
diff --git a/doc/main/tbb_userguide/How_Task_Scheduler_Works.rst b/doc/main/tbb_userguide/How_Task_Scheduler_Works.rst
@@ -0,0 +1,50 @@
+.. _How_Task_Scheduler_Works.rst:
+
+How Task Scheduler Works
+========================
+
+
+While the task scheduler is not bound to any particular type of parallelism, 
+it was designed to work efficiently for fork-join parallelism with lots of forks.
+This type of parallelism is typical for parallel algorithms such as `oneapi::tbb::parallel_for
+<https://spec.oneapi.io/versions/latest/elements/oneTBB/source/algorithms/functions/parallel_for_func.html>`_.
+
+Let's consider the mapping of fork-join parallelism on the task scheduler in more detail. 
+
+The scheduler runs tasks in a way that tries to achieve several targets simultaneously: 
+ - utilize as more threads as possible, to achieve actual parallelism
- - utilize as more threads as possible, to achieve actual parallelism
+ - Utilize as more threads as possible, to achieve actual parallelism
- - utilize as more threads as possible, to achieve actual parallelism
+ - Utilize as more threads as possible, to achieve actual parallelism
+ - Preserve data locality to make a single thread execution more efficient  
+ - Minimize both memory demands and cross-thread communication to reduce an overhead 
+
+To achieve this, a balance between depth-first and breadth-first execution strategies 
+must be reached. Assuming that the task graph is finite, depth-first is better for 
+a sequential execution because:
+
+- **Strike when the cache is hot**. The deepest tasks are the most recently created tasks and therefore are the hottest in the cache.
+  Also, if they can be completed, tasks that depend on it can continue executing, and though not the hottest in a cache, 
+  they are still warmer than the older tasks deeper in the dequeue.
+
+- **Minimize space**. Execution of the shallowest task leads to the breadth-first unfolding of a graph. It creates an exponential
+  number of nodes that co-exist simultaneously. In contrast, depth-first execution creates the same number 
+  of nodes, but only a linear number can exists at the same time, since it creates a stack of other ready 
+  tasks.
+
+Each thread has its deque of tasks that are ready to run. When a 
+thread spawns a task, it pushes it onto the bottom of its deque.
+
+When a thread participates in the evaluation of tasks, it constantly executes 
+a task obtained by the first rule that applies from the roughly equivalent ruleset:
+
+- Get the task returned by the previous one, if any.
+
+- Take a task from the bottom of its deque, if any.
+
+- Steal a task from the top of another randomly chosen deque. If the 
+  selected deque is empty, the thread tries again to execute this rule until it succeeds.
+
+Rule 1 is described in :doc:`Task Scheduler Bypass <Task_Scheduler_Bypass>`. 
+The overall effect of rule 2 is to execute the *youngest* task spawned by the thread, 
+which causes the depth-first execution until the thread runs out of work. 
+Then rule 3 applies. It steals the *oldest* task spawned by another thread, 
+which causes temporary breadth-first execution that converts potential parallelism 
+into actual parallelism.
diff --git a/doc/main/tbb_userguide/Task_Scheduler_Bypass.rst b/doc/main/tbb_userguide/Task_Scheduler_Bypass.rst
@@ -0,0 +1,20 @@
+.. _Task_Scheduler_Bypass:
+
+Task Scheduler Bypass
+=====================
+
+Scheduler bypass is an optimization where you directly specify the next task to run. 
+According to the rules of execution described in :doc:`How Task Scheduling Works <How_Does_Task_Scheduler_Works>`, 
+the spawning of the new task to be executed by the current thread involves the next steps:
+
+ -  Push a new task onto the thread's deque.
+ -  Continue to execute the current task until it is completed.
+ -  Take a task from the thread's deque, unless it is stolen by another thread.
+
+Steps 1 and 3 introduce unnecessary deque operations or, even worse, allow stealing that can hurt 
+locality without adding significant parallelism. These problems can be avoided by using "Task Scheduler Bypass" technique to directly point the preferable task to be executed next 
+instead of spawning it. When, as described in :doc:`How Task Scheduler Works <How_Task_Scheduler_Works>`,
+the returned task becomes the first candidate for the next task to be executed by the thread. Furthermore, this approach almost guarantees that 
+the task is executed by the current thread and not by any other thread.
+
+Please note that at the moment the only way to use this optimization is to use `preview feature of ``onepai::tbb::task_group`` 
diff --git a/doc/main/tbb_userguide/Task_Scheduler_Summary.rst b/doc/main/tbb_userguide/Task_Scheduler_Summary.rst
diff --git a/doc/main/tbb_userguide/The_Task_Scheduler.rst b/doc/main/tbb_userguide/The_Task_Scheduler.rst
@@ -16,5 +16,6 @@ onto one of the high-level templates, use the task scheduler.
 
    ../tbb_userguide/Task-Based_Programming
    ../tbb_userguide/When_Task-Based_Programming_Is_Inappropriate
-   ../tbb_userguide/Task_Scheduler_Summary
+   ../tbb_userguide/How_Task_Scheduler_Works
+   ../tbb_userguide/Task_Scheduler_Bypass
    ../tbb_userguide/Guiding_Task_Scheduler_Execution