-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[BugFix][V1] Fix parallel sampling finishing/aborts #14512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Requests not finishing at the same time - Coalescing outputs from different child requests - Aborting n > 1 requests (parent request id needs to propagate to child requests) Thanks to @himanshujaju for reporting the first two of these bugs Signed-off-by: Nick Hill <nhill@redhat.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
| self.num_cached_tokens = num_cached_tokens | ||
|
|
||
| @classmethod | ||
| def new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is now unused
| self.prompt = next_output.prompt | ||
| self.prompt_token_ids = next_output.prompt_token_ids | ||
| self.prompt_logprobs = next_output.prompt_logprobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was another preexisting bug
| completion.finish_reason = next_completion.finish_reason | ||
| completion.stop_reason = next_completion.stop_reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were another preexisting omission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we have a higher layer aggregating delta RequestOutputs using this method (not least because it seems I basically missed it completely!)
There's no particular reason why OutputProcessor.process_outputs() couldn't ensure that it has aggregated before pushing the RequestOutput to the per-request queue - it's not like there's any advantage to pushing to the queue early, since this method doesn't yield
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be tricky and require some care though - I wouldn't attempt my suggestion as part of fixing this bug 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this aggregation is a non-deterministic performance optimization/safety-net. If things are running fast enough i.e. the queue then no aggregation will happen, but we have seen circumstances where the queues back up, and so it's better to coalesce the messages rather than incurring the overhead of sending them separately (and the additional asyncio task churn that it entails).
We could still do it in the output processor I guess but that would mean doing something a bit more custom than an asyncio.Queue (e.g. just using asyncio.Event)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I am planning to change this as a follow-on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
It seems the tests for this code path aren't enabled in CI, and when I had been running them locally I wasn't aware I had to set Suggest we include the following in this PR: Adding the TODO for now because attempting to enable all of |
markmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess all my comments suggest further refactoring, but none of them should take priority over getting the bug fix merged
| completion.finish_reason = next_completion.finish_reason | ||
| completion.stop_reason = next_completion.stop_reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we have a higher layer aggregating delta RequestOutputs using this method (not least because it seems I basically missed it completely!)
There's no particular reason why OutputProcessor.process_outputs() couldn't ensure that it has aggregated before pushing the RequestOutput to the per-request queue - it's not like there's any advantage to pushing to the queue early, since this method doesn't yield
vllm/v1/engine/output_processor.py
Outdated
| if not outputs: | ||
| return None | ||
| request_id = self.parent_req.request_id | ||
| finished = not self.parent_req.child_requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything under the parent_req is not None case should be encapsulated in parallel_sampling.py IMO - e.g. the parent request ID and finished handling shouldn't be in this file
| self.request_states[request_id] = req_state | ||
| self.lora_states.add_request(req_state) | ||
| if parent_req: | ||
| self.parent_requests[parent_req.request_id] = parent_req |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this request_id -> ParentRequest mapping make more sense at the AsyncLLM level?
The only place we're using it is in aborts - so we could have this stuff at the AsyncLLM level:
parent = self.parent_requests.pop(request_id, None)
if parent and parent.child_requests:
self.abort_requests(parent.child_requests)
request_ids_to_abort.extend(parent.child_requests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this
request_id -> ParentRequestmapping make more sense at the AsyncLLM level?
Kind of yes but AsyncLLM doesn't otherwise currently track request-level state and this logic would then need to be duplicated in AsyncLLM and LLMEngine. So I'm not sure which is best TBH
| self, | ||
| request_ids: list[str], | ||
| ) -> None: | ||
| request_ids: Iterable[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are parent request IDs, but OutputProcessor.add_request() deals with child request IDs - don't love the asymmetry
| completion.finish_reason = next_completion.finish_reason | ||
| completion.stop_reason = next_completion.stop_reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be tricky and require some care though - I wouldn't attempt my suggestion as part of fixing this bug 👍
|
As per my comment above, it would be good to check we have CI coverage for all three of these if possible:
|
As discussed with @russellb just now, this is reproducible just running |
Can we change this test to use spawn? |
Signed-off-by: Nick Hill <nhill@redhat.com>
See #14579 - the issue is in the structured output tests, and we're struggling to reproduce it You should be able to enable |
Signed-off-by: Nick Hill <nhill@redhat.com>
|
This is a genuine failure in the CI |
which, the structured output one? |
|
This is the valid failure: |
| n = 3 | ||
| max_tokens = 5 | ||
| max_tokens = 50 # we want some to finish earlier than others | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file isn't actually run in the V1 tests in the buildkite config
See #14579
|
Sorry, I will get this fixed up, had intermittent access on a plane yesterday so just pushed without testing. |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Richard Liu <ricliu@google.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Thanks to @himanshujaju for reporting the first two of these bugs in slack.
cc @markmc @afeldman-nm