[DOC] Composability of different threading runtimes #26950

peterchen-intel · 2024-10-08T10:09:43Z

Details:

Document composability of different threading runtimes when running inferences and other application logic on CPU device
Document threading impact for LLM with Optimum Intel API

Tickets:

CVS-150542, CVS-145996

Signed-off-by: Chen, Peter <peter.chen@intel.com>

dmitry-gorokhov · 2024-10-14T09:59:07Z

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

+
+.. _Inference_threads_wait_actively:
+
+Inference threads wait actively


Since this is generic threading tips page I would ask you to formalize the description in more format way:

Case you are describing is an example of serial composability of different threading runtimes. So it should be name of the section.

We need to formalize the application that we are focusing on: pipeline with multiple OV inferences interleaved with some other application logic (maybe calls to another library) executed sequentially.

We need to describe the reason of performance issues in that scenarion. Actually you already shared some info about active searching for work which takes CPU resources. It worth to explicitly mentioned that it is true for both TBB and OMP so threads migration between areas will happen twice per pipeline iteration.

1ms is very specific to particular library - would avoid detailed numbers

Lets describe all possible sotuion:
5.1. Most effective is to use oneTBB for all computations made in pipeline
5.2. Rebuilding OV with OMP from source is another option,
5.3. Limit number of threads / disable pinning for OV or other parts of the pipeline to let OS do better scheduling.
5.4. In case second runtime is OMP user can set OMP_WAIT_POLICY=PASSIVE to minimize perf gap on OMP->TBB runtime switch.

@wangleis Do you have anything to add?

LGTM, Thanks.

dmitry-gorokhov · 2024-10-14T10:00:04Z

docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst

+##########################
+
+As mentioned on :ref:`Inference threads wait actively <Inference_threads_wait_actively>`, OpenVINO default threading library
+oneTBB keeps CPU cores actively for 1ms after inference done. When using Optimum Intel Python API,


Same: 1ms is very specific to particular library - would avoid detailed numbers

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

dmitry-gorokhov · 2024-10-16T12:19:25Z

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

+OpenVINO is by default built with `oneTBB <https://github.com/oneapi-src/oneTBB/>`__ threading library,
+oneTBB has a feature worker_wait like `OpenMP <https://www.openmp.org/>`__ `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__ which makes OpenVINO inference
+threads wait actively for a while after task done. The intention is to avoid CPU inactive in the
+tranaction time between tasks of inference. 


tranaction?

Changed to transition

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

dmitry-gorokhov · 2024-10-16T12:22:20Z

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

+- Most effective way is to use oneTBB for all computations made in pipeline.
+- Rebuild OpenVINO with OpenMP and other application logic uses OpenMP as well.
+- Limit number of threads of OpenVINO and other parts to let OS do better scheduling.
+- Set environment variable `OMP_WAIT_POLICY <https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html>`__ to PASSIVE which will disable OpenMP `busy-wait <https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html>`__


Still need to mentioned that other part on application should use OMP underneath

Updated. Please help to review again.

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst

…26950) ### Details: - *Document composability of different threading runtimes when running inferences and other application logic on CPU device* - *Document threading impact for LLM with Optimum Intel API* ### Tickets: - *CVS-150542, CVS-145996* --------- Signed-off-by: Chen, Peter <peter.chen@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

github-actions bot added the category: docs OpenVINO documentation label Oct 8, 2024

peterchen-intel added 2 commits October 8, 2024 22:19

[DOC] CPU inference threads

18affca

Signed-off-by: Chen, Peter <peter.chen@intel.com>

Try to fix the links

29d90ac

Signed-off-by: Chen, Peter <peter.chen@intel.com>

peterchen-intel force-pushed the docs/cpu/threading branch from 477c3bd to 18affca Compare October 8, 2024 14:57

peterchen-intel mentioned this pull request Oct 8, 2024

Add hook sample for new transformers openvinotoolkit/openvino.genai#801

Merged

peterchen-intel marked this pull request as ready for review October 8, 2024 15:12

peterchen-intel requested a review from a team as a code owner October 8, 2024 15:12

peterchen-intel requested review from zKulesza, wangleis, dmitry-gorokhov, andrei-kochin, slyalin and eaidova and removed request for a team October 8, 2024 15:12

dmitry-gorokhov added this to the 2024.5 milestone Oct 9, 2024

peterchen-intel requested a review from a team October 10, 2024 01:52

dmitry-gorokhov reviewed Oct 14, 2024

View reviewed changes

Composability of different threading runtimes

f1faeb8

peterchen-intel commented Oct 16, 2024

View reviewed changes

Update wording

8bd8a02

peterchen-intel commented Oct 16, 2024

View reviewed changes

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst Outdated Show resolved Hide resolved

Document format

7a3b08a

peterchen-intel requested a review from dmitry-gorokhov October 16, 2024 06:55

peterchen-intel changed the title ~~[DOC] CPU inference threads~~ [DOC] Composability of different threading runtimes Oct 16, 2024

dmitry-gorokhov reviewed Oct 16, 2024

View reviewed changes

peterchen-intel commented Oct 17, 2024

View reviewed changes

Apply suggestions from code review

dbc9a8b

peterchen-intel requested a review from dmitry-gorokhov October 17, 2024 14:11

Merge branch 'master' into docs/cpu/threading

fe6b3e7

tsavina reviewed Oct 17, 2024

View reviewed changes

peterchen-intel and others added 3 commits October 20, 2024 20:18

Apply suggestions from code review

6fea405

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

Apply suggestions from code review

a445042

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

Apply suggestions from code review

7ed272d

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

peterchen-intel commented Oct 20, 2024

View reviewed changes

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst Outdated Show resolved Hide resolved

Update wording

8b88b96

dmitry-gorokhov approved these changes Oct 21, 2024

View reviewed changes

peterchen-intel requested a review from tsavina October 24, 2024 01:04

tsavina approved these changes Oct 24, 2024

View reviewed changes

tsavina reviewed Oct 24, 2024

View reviewed changes

...-inference/inference-devices-and-modes/cpu-device/performance-hint-and-thread-scheduling.rst Outdated Show resolved Hide resolved

tiny format fix

dbbe6da

peterchen-intel added this pull request to the merge queue Oct 26, 2024

Merged via the queue into openvinotoolkit:master with commit 0c07136 Oct 26, 2024
126 checks passed

peterchen-intel deleted the docs/cpu/threading branch October 26, 2024 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Composability of different threading runtimes #26950

[DOC] Composability of different threading runtimes #26950

peterchen-intel commented Oct 8, 2024 •

edited

Loading

dmitry-gorokhov Oct 14, 2024

wangleis Oct 21, 2024

dmitry-gorokhov Oct 14, 2024

dmitry-gorokhov Oct 16, 2024

peterchen-intel Oct 17, 2024

dmitry-gorokhov Oct 16, 2024

peterchen-intel Oct 17, 2024


		.. _Inference_threads_wait_actively:

		Inference threads wait actively

[DOC] Composability of different threading runtimes #26950

[DOC] Composability of different threading runtimes #26950

Conversation

peterchen-intel commented Oct 8, 2024 • edited Loading

Details:

Tickets:

dmitry-gorokhov Oct 14, 2024

Choose a reason for hiding this comment

wangleis Oct 21, 2024

Choose a reason for hiding this comment

dmitry-gorokhov Oct 14, 2024

Choose a reason for hiding this comment

dmitry-gorokhov Oct 16, 2024

Choose a reason for hiding this comment

peterchen-intel Oct 17, 2024

Choose a reason for hiding this comment

dmitry-gorokhov Oct 16, 2024

Choose a reason for hiding this comment

peterchen-intel Oct 17, 2024

Choose a reason for hiding this comment

peterchen-intel commented Oct 8, 2024 •

edited

Loading