Workflows profiler #710

PawelPeczek-Roboflow · 2024-10-02T12:25:34Z

Description

PR adding profiling capabilities to Workflows Execution Engine and changes to improve speed:

caching of Workflow definition in memory / redis
caching list of blocks to make it faster to compile

Traces can be preview using chrome://tracing/

Also added changes to inference_sdk client and InferencePipeline to enable profiling logs gathering.

PS: there are still plenty of things to improve, compilation can still take 10-25ms which is a lot if we consider really small models that (on GPU) could run inference in comparable speed - basically workflows in such scenario slaughters throughput 😢 Hopefully, for larger models / more advanced workflows - the overhead in % of total execution is small, and! for video processing the compilation only happens once.

🎥 Inference Pipeline profiling - MacBook Pro

For InferencePipeline , profiler accumulates information for consecutive processed frames:

Compilation overhead is negligible, only happens once at the beginning
Workflow I used for test is yolov8-640+bbox visualisation. For each frame ~95% of time takes model block which is in my empirical measurement slightly slower than standard model.predict() in `inference` - due to additional transformations of data into sv.detections and management of metadata
On average, one frame is processed in 40ms by EE, where ~38ms takes model block itself and we could run at ~34-35ms using standard model.predict() in `inference` - which means that FPS drops from ~29-30 to ~25-26
Observation: Workflows EE itself adds small nominal latency, but this 1-2ms added by engine itself matters when we count FPS

🐎 Speed improvements - caching

As illustrated below, there were two obvious improvements for performance speed of Workflows EE regarding processing requests in inference server:

When using Workflow that is saved on Roboflow platform - we were always pooling the definition which, dependent on our Roboflow API load could take ~300ms - basically doubling the inference time from smaller models
Substantial amount of time - on my MacBook around 140ms - was consumed on assembling the `pydantic` Model that contains all dynamically loaded blocks - making it possible to parse the manifest.

Solutions

Caching Workflows definitions

We are not always pooling Workflow definition from API now - we use memory / Redis cache to save the definition for 15 minutes. We expose `use_cache` option for request payload to disable cache read / write.

Caching `pydantic` models for given set of blocks

Without use of dynamic blocks - we can expect change in pool of blocks only once the `inference` process is loaded into memory - hence for whole runtime of the server, set of blocks should be constant and we can only build entity for manifest parsing once. Added simple memory cache to keep the definitions.

With enterprise blocks we will change this state of affairs, but even then this simple cache can be quite performant - as we would have limited number of plugins and with the cache of quite small size we may be able to load all variations of entity into memory.

This simple caching will not work in general sense for dynamic blocs:

this is not a problem for hosted platform
for self-hosted deployment - that would be the problem and we can expect +100-140ms latency on each request when given node processes dynamic blocks (for video, not that relevant)

Results

Before	After

~850ms	~375ms

🍒 Cherry-picked example when Hosted API is faster than self-hosted

This example is not intended to make the point that Hosted platform is faster than self-hosted - it only illustrates the scenario when workflow hugely benefits from parallel requests AND hosted platform workers are warm regarding required models AND device self-hosting the server is not powerful enough to run multiple models at the same time.

Results

Local server	Hosted platform

~650ms	~450ms

🏃 Benchmark on Tesla T4 - Workflows with model block vs inference server request

@PacificDou report abbreviated as SR

Model family: yolov8n - different input sizes

Model	img-size	SR server	server	SR EE	EE	EE overhead [%]
football-player-detection-ej9zh/12	320	52.4 RPS	48 RPS	2 RPS	28 RPS	+75%
football-player-detection-ej9zh/13	640	45 RPS	36.5 RPS	2 RPS	24 RPS	+52%
football-player-detection-ej9zh/14	960	31.9 RPS	26 RPS	2 RPS	18.5 RPS	+38.5%
football-player-detection-ej9zh/15	1280	24 RPS	19 RPS	1.8 RPS	13.5 RPS	+39%

That was always the case that EE was adding ~+10-15ms latency ❗

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

CI still 🟢
new automated tests
tested and measured E2E in different scenarios

Any specific deployment considerations

For example, documentation changes, usability, usage/costs, secrets, etc.

Docs

Docs updated? What were the changes:

inference/core/entities/requests/workflows.py

inference/core/interfaces/stream/inference_pipeline.py

inference/core/roboflow_api.py

PawelPeczek-Roboflow added 4 commits October 1, 2024 18:09

Basic version of profiler

caa5197

Performance improvements

b82b1d3

Refactor the code and bring tests into order

1b47c77

Fix issues spotted in initial tests

199b1ed

PawelPeczek-Roboflow added the release 0.22.0 label Oct 2, 2024

PawelPeczek-Roboflow added 6 commits October 2, 2024 14:29

Fix dockerfiles syntax

c4729e1

Resolve conflicts with main

b05b382

Apply minor fixes

9966c48

Add tests

ce277a2

Merge branch 'main' into feature/workflows_profiler

c9a0241

Merge branch 'main' into feature/workflows_profiler

98c3991

PawelPeczek-Roboflow marked this pull request as ready for review October 3, 2024 09:23

PawelPeczek-Roboflow requested review from capjamesg, grzegorz-roboflow, yeldarby, probicheaux and hansent as code owners October 3, 2024 09:23

PawelPeczek-Roboflow added 7 commits October 3, 2024 11:50

Fix docs

64d04a1

Change default dir for profiles

b838b2d

Merge branch 'main' into feature/workflows_profiler

5855b3f

Resolve conflicts with main

148300d

Resolve conflicts with main

6dcd612

Merge branch 'main' into feature/workflows_profiler

2a5e72f

Merge branch 'main' into feature/workflows_profiler

82fb825

grzegorz-roboflow previously approved these changes Oct 4, 2024

View reviewed changes

PawelPeczek-Roboflow added 2 commits October 4, 2024 13:40

Applied CR comments

480bc62

Merge branch 'main' into feature/workflows_profiler

0139543

PawelPeczek-Roboflow dismissed grzegorz-roboflow’s stale review via 0139543 October 4, 2024 11:42

grzegorz-roboflow approved these changes Oct 4, 2024

View reviewed changes

PawelPeczek-Roboflow merged commit e7071a0 into main Oct 4, 2024
57 checks passed

PawelPeczek-Roboflow deleted the feature/workflows_profiler branch October 4, 2024 12:02

PawelPeczek-Roboflow mentioned this pull request Oct 7, 2024

Caching compilation results #729

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflows profiler #710

Workflows profiler #710

PawelPeczek-Roboflow commented Oct 2, 2024 •

edited

Loading

Workflows profiler #710

Workflows profiler #710

Conversation

PawelPeczek-Roboflow commented Oct 2, 2024 • edited Loading

Description

Solutions

Caching Workflows definitions

Caching `pydantic` models for given set of blocks

Results

Results

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

PawelPeczek-Roboflow commented Oct 2, 2024 •

edited

Loading