-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows profiler #710
Merged
Merged
Workflows profiler #710
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PawelPeczek-Roboflow
requested review from
capjamesg,
grzegorz-roboflow,
yeldarby,
probicheaux and
hansent
as code owners
October 3, 2024 09:23
grzegorz-roboflow
previously approved these changes
Oct 4, 2024
grzegorz-roboflow
approved these changes
Oct 4, 2024
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
PR adding profiling capabilities to Workflows Execution Engine and changes to improve speed:
Traces can be preview using
chrome://tracing/
Also added changes to
inference_sdk
client andInferencePipeline
to enable profiling logs gathering.PS: there are still plenty of things to improve, compilation can still take 10-25ms which is a lot if we consider really small models that (on GPU) could run inference in comparable speed - basically workflows in such scenario slaughters throughput 😢 Hopefully, for larger models / more advanced workflows - the overhead in % of total execution is small, and! for video processing the compilation only happens once.
🎥 Inference Pipeline profiling - MacBook Pro
ForInferencePipeline
, profiler accumulates information for consecutive processed frames:🐎 Speed improvements - caching
As illustrated below, there were two obvious improvements for performance speed of Workflows EE regarding processing requests ininference
server:Solutions
Caching Workflows definitions
We are not always pooling Workflow definition from API now - we use memory / Redis cache to save the definition for 15 minutes. We expose `use_cache` option for request payload to disable cache read / write.Caching `pydantic` models for given set of blocks
Without use of dynamic blocks - we can expect change in pool of blocks only once the `inference` process is loaded into memory - hence for whole runtime of the server, set of blocks should be constant and we can only build entity for manifest parsing once. Added simple memory cache to keep the definitions.With enterprise blocks we will change this state of affairs, but even then this simple cache can be quite performant - as we would have limited number of plugins and with the cache of quite small size we may be able to load all variations of entity into memory.
This simple caching will not work in general sense for dynamic blocs:
Results
🍒 Cherry-picked example when Hosted API is faster than self-hosted
This example is not intended to make the point that Hosted platform is faster than self-hosted - it only illustrates the scenario when workflow hugely benefits from parallel requests AND hosted platform workers are warm regarding required models AND device self-hosting the server is not powerful enough to run multiple models at the same time.Results
🏃 Benchmark on Tesla T4 - Workflows with model block vs inference server request
@PacificDou report abbreviated as SRModel family:
yolov8n
- different input sizesThat was always the case that EE was adding ~+10-15ms latency ❗
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
Any specific deployment considerations
For example, documentation changes, usability, usage/costs, secrets, etc.
Docs