Skip to content

Conversation

@mayabar
Copy link
Collaborator

@mayabar mayabar commented Oct 30, 2025

Problem: When handling requests with stream=true, a worker is returned to the pool of free workers before streaming completes. As a result, the same worker may get more requests to process, in addition calculation of decode time is wrong.

Solution: Run processRequest in a separate go routine, and use a WaitGroup to ensure that worker is not released until all streaming chunks have been sent.

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@mayabar mayabar requested a review from shmuelk October 30, 2025 12:18
…lculations for requests in streaming mode

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@shmuelk
Copy link
Collaborator

shmuelk commented Oct 30, 2025

/lgtm
/approve

@github-actions github-actions bot added the lgtm label Oct 30, 2025
@github-actions github-actions bot merged commit 658e3e5 into llm-d:main Oct 30, 2025
4 checks passed
@mayabar mayabar deleted the streaming-in-queue branch November 4, 2025 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants