Skip to content

Commit

Permalink
[DOC] Composability of different threading runtimes (openvinotoolkit#…
Browse files Browse the repository at this point in the history
…26950)

### Details:
- *Document composability of different threading runtimes when running
inferences and other application logic on CPU device*
 - *Document threading impact for LLM with Optimum Intel API*

### Tickets:
 - *CVS-150542, CVS-145996*

---------

Signed-off-by: Chen, Peter <peter.chen@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
  • Loading branch information
2 people authored and CuriousPanCake committed Nov 6, 2024
1 parent 3398f28 commit 41e5f48
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,19 @@ model to avoid extra computation. This is how it can be done for LLMs:
Now the model can be converted to OpenVINO using Optimum Intel Python API or CLI interfaces
mentioned above.

Execution on CPU device
##########################

As mentioned in the :ref:`Composability of different threading runtimes <Composability_of_different_threading_runtimes>` section, OpenVINO's default threading runtime,
oneTBB, keeps CPU cores active for a while after inference is done. When using Optimum Intel Python API,
it calls Torch (via HF transformers) for postprocessing, such as beam search or gready search.
Torch uses OpenMP for threading, OpenMP needs to wait for CPU cores that are kept active by
oneTBB. By default, OpenMP uses the `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__ which can delay the next OpenVINO inference as well.

It is recommended to:

* Limit the number of CPU threads used by Torch with `torch.set_num_threads <https://pytorch.org/docs/stable/generated/torch.set_num_threads.html>`__.
* Set the environment variable `OMP_WAIT_POLICY <https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html>`__ to `PASSIVE`, which disables OpenMP `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__.

Additional Resources
#####################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -187,3 +187,23 @@ are executed in parallel.

For details on multi-stream execution check the
:doc:`optimization guide <../../optimize-inference/optimizing-throughput/advanced_throughput_options>`.

.. _Composability_of_different_threading_runtimes:

Composability of different threading runtimes
#############################################

OpenVINO is by default built with the `oneTBB <https://github.com/oneapi-src/oneTBB/>`__ threading library,
oneTBB has a feature `worker_wait`, similar to `OpenMP <https://www.openmp.org/>`__ `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__, which makes OpenVINO inference
threads wait actively for a while after a task done. The intention is to avoid CPU inactivity in the
transition time between inference tasks.

In the pipeline that runs OpenVINO inferences on the CPU along with other sequential application logic, using different threading runtimes (e.g., OpenVINO inferences use oneTBB,
while other application logic uses OpenMP) will cause both to occupy CPU cores for additional time after the task done, leading to overhead.

Recommended solutions:

- The most effective way is to use oneTBB for all computations made in the pipeline.
- Rebuild OpenVINO with OpenMP if other application logic uses OpenMP.
- Limit the number of threads for OpenVINO and other parts and let OS do the scheduling.
- If other application logic uses OpenMP, set the environment variable `OMP_WAIT_POLICY <https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html>`__ to `PASSIVE` to disable OpenMP `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__.

0 comments on commit 41e5f48

Please sign in to comment.