-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continous batching for single GPU LLM inference #2628
Conversation
frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java
Outdated
Show resolved
Hide resolved
frontend/server/src/main/java/org/pytorch/serve/wlm/BatchAggregator.java
Outdated
Show resolved
Hide resolved
frontend/server/src/main/java/org/pytorch/serve/wlm/BatchAggregator.java
Outdated
Show resolved
Hide resolved
frontend/server/src/main/java/org/pytorch/serve/wlm/ContinuousBatching.java
Outdated
Show resolved
Hide resolved
logger = logging.getLogger(__name__) | ||
|
||
|
||
class StreamingHandler(BaseHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this handler to ts_handler/distrubuted or move the core function to handler_utils/distributed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets postpone this for a later PR. I want to get more clarity on the details on the TP implementation first and see what is the overlap between them to make sure we only move the generic part into core.
Codecov Report
@@ Coverage Diff @@
## master #2628 +/- ##
==========================================
+ Coverage 71.34% 72.39% +1.05%
==========================================
Files 85 85
Lines 3905 3956 +51
Branches 58 58
==========================================
+ Hits 2786 2864 +78
+ Misses 1115 1088 -27
Partials 4 4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…atching_for_streaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This PR enables continuous batching for LLM by creating a new batch aggregator that keeps jobs in the batch as long as they are not yet finished.
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Checklist: