[DOC] Composability of different threading runtimes (openvinotoolkit#…

…26950) ### Details: - *Document composability of different threading runtimes when running inferences and other application logic on CPU device* - *Document threading impact for LLM with Optimum Intel API* ### Tickets: - *CVS-150542, CVS-145996* --------- Signed-off-by: Chen, Peter <peter.chen@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
CuriousPanCake · Nov 6, 2024 · 41e5f48 · 41e5f48
1 parent 3398f28
commit 41e5f48
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 0 deletions.
diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst
@@ -301,6 +301,19 @@ model to avoid extra computation. This is how it can be done for LLMs:
 Now the model can be converted to OpenVINO using Optimum Intel Python API or CLI interfaces
 mentioned above.
 
+Execution on CPU device
+##########################
+
+As mentioned in the :ref:`Composability of different threading runtimes <Composability_of_different_threading_runtimes>` section, OpenVINO's default threading runtime,
+oneTBB, keeps CPU cores active for a while after inference is done. When using Optimum Intel Python API,
+it calls Torch (via HF transformers) for postprocessing, such as beam search or gready search.
+Torch uses OpenMP for threading, OpenMP needs to wait for CPU cores that are kept active by
+oneTBB. By default, OpenMP uses the `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__ which can delay the next OpenVINO inference as well.
+
+It is recommended to:
+
+* Limit the number of CPU threads used by Torch with `torch.set_num_threads <https://pytorch.org/docs/stable/generated/torch.set_num_threads.html>`__.
+* Set the environment variable `OMP_WAIT_POLICY <https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html>`__ to `PASSIVE`, which disables OpenMP `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__.
 
 Additional Resources
 #####################

diff --git a/...ference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst b/...ference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst
@@ -187,3 +187,23 @@ are executed in parallel.
 
 For details on multi-stream execution check the
 :doc:`optimization guide <../../optimize-inference/optimizing-throughput/advanced_throughput_options>`.
+
+.. _Composability_of_different_threading_runtimes:
+
+Composability of different threading runtimes
+#############################################
+
+OpenVINO is by default built with the `oneTBB <https://github.com/oneapi-src/oneTBB/>`__ threading library,
+oneTBB has a feature `worker_wait`, similar to `OpenMP <https://www.openmp.org/>`__ `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__, which makes OpenVINO inference
+threads wait actively for a while after a task done. The intention is to avoid CPU inactivity in the
+transition time between inference tasks. 
+
+In the pipeline that runs OpenVINO inferences on the CPU along with other sequential application logic, using different threading runtimes (e.g., OpenVINO inferences use oneTBB,
+while other application logic uses OpenMP) will cause both to occupy CPU cores for additional time after the task done, leading to overhead. 
+
+Recommended solutions:
+
+- The most effective way is to use oneTBB for all computations made in the pipeline.
+- Rebuild OpenVINO with OpenMP if other application logic uses OpenMP.
+- Limit the number of threads for OpenVINO and other parts and let OS do the scheduling.
+- If other application logic uses OpenMP, set the environment variable `OMP_WAIT_POLICY <https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html>`__ to `PASSIVE` to disable OpenMP `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__.