-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[V1] Avoid redundant input processing in n>1 case #14985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Follow-on streamlining of V1 parallel sampling implementation. Includes removing the unnecessary/unused request_id arg from tokenizer encode methods. Signed-off-by: Nick Hill <nhill@redhat.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/v1/engine/llm_engine.py
Outdated
| parent_req = ParentRequest(request_id, params) | ||
| for idx in range(n): | ||
| request_id, params = parent_req.get_child_info(idx) | ||
| child_req = request if idx == n - 1 else copy(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we okay with shallow copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since the EngineCoreRequest is just immediately serialized. Even in the in-proc case, none of the contents are mutated.
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
afeldman-nm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it will be great to eliminate input processing redundancy. Mostly just nits in terms of feedback.
vllm/v1/engine/llm_engine.py
Outdated
| # 3) Make a new RequestState and queue. | ||
| self.output_processor.add_request(child_req, parent_req, idx) | ||
| # 3) Add the request to EngineCore. | ||
| self.engine_core.add_request(child_req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would seem that we are not enumerating processing steps in the prior comments, so "3)" can be eliminated.
vllm/v1/engine/async_llm.py
Outdated
| parent_req = ParentRequest(request_id, params) | ||
| for idx in range(n): | ||
| request_id, params = parent_req.get_child_info(idx) | ||
| child_req = request if idx == n - 1 else copy(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why not
| child_req = request if idx == n - 1 else copy(request) | |
| child_req = request if idx == 0 else copy(request) |
? Unless there is a specifically reason to pass the original request data structure to child n-1, why not simplify by passing it to child 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just defensive again, we are dispatching each copy of the request so potentially better to not copy the ones that have already been sent.
vllm/v1/engine/async_llm.py
Outdated
| if n == 1: | ||
| await self._add_request(request, None, 0, queue) | ||
|
|
||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if worthwhile, but perhaps you could rewrite this section to make the else redundant:
if n == 1:
await self._add_request(request, None, 0, queue)
return queue
# Fan out child requests (for n>1).
parent_req = ParentRequest(request_id, params)
for idx in range(n):
request_id, params = parent_req.get_child_info(idx)
child_req = request if idx == n - 1 else copy(request)
child_req.request_id = request_id
child_req.sampling_params = params
await self._add_request(child_req, parent_req, idx, queue)
return queue
I like this because it expresses that there is a "quick exit" when n==1 and a longer process otherwise.
| # Process raw inputs into the request. | ||
| request = self.processor.process_inputs(request_id, prompt, params, | ||
| arrival_time, lora_request, | ||
| trace_headers, | ||
| prompt_adapter_request, | ||
| priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved the call which was already there :)
| # Convert Input --> Request. | ||
| request = self.processor.process_inputs(request_id, prompt, params, | ||
| arrival_time, lora_request, | ||
| trace_headers, | ||
| prompt_adapter_request, | ||
| priority) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
vllm/v1/engine/async_llm.py
Outdated
| parent_req = ParentRequest(request_id, params) | ||
| for idx in range(n): | ||
| request_id, params = parent_req.get_child_info(idx) | ||
| child_req = request if idx == n - 1 else copy(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe child_request instead of child_req, if you prefer not to use abbreviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I would typically abbreviate if it helped to avoid line wrapping.
vllm/v1/engine/async_llm.py
Outdated
|
|
||
| else: | ||
| # Fan out child requests (for n>1). | ||
| parent_req = ParentRequest(request_id, params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe parent_request, if you prefer not to use abbreviations?
vllm/v1/engine/llm_engine.py
Outdated
|
|
||
| else: | ||
| # Fan out child requests (for n>1). | ||
| parent_req = ParentRequest(request_id, params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe parent_request, if you prefer not to use abbreviations?
vllm/v1/engine/llm_engine.py
Outdated
| parent_req = ParentRequest(request_id, params) | ||
| for idx in range(n): | ||
| request_id, params = parent_req.get_child_info(idx) | ||
| child_req = request if idx == n - 1 else copy(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe child_request, if you prefer not to use abbreviations?
…eproc # Conflicts: # vllm/v1/engine/async_llm.py
Signed-off-by: Nick Hill <nhill@redhat.com>
|
@njhill Please let me know if this PR is good to merge! |
|
@WoosukKwon yes it's ready, thanks! I don't know why the CI tests have started randomly OOMing. |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Follow-on streamlining of V1 parallel sampling implementation. No need to repeat input processing for each sub-request.
Includes removing the unnecessary/unused request_id arg from tokenizer encode methods.
cc @markmc @afeldman-nm