[V1] Avoid redundant input processing in n>1 case #14985

njhill · 2025-03-17T21:47:16Z

Follow-on streamlining of V1 parallel sampling implementation. No need to repeat input processing for each sub-request.

Includes removing the unnecessary/unused request_id arg from tokenizer encode methods.

Follow-on streamlining of V1 parallel sampling implementation. Includes removing the unnecessary/unused request_id arg from tokenizer encode methods. Signed-off-by: Nick Hill <nhill@redhat.com>

github-actions · 2025-03-17T21:47:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

robertgshaw2-redhat · 2025-03-17T22:08:28Z

vllm/v1/engine/llm_engine.py

+            parent_req = ParentRequest(request_id, params)
+            for idx in range(n):
+                request_id, params = parent_req.get_child_info(idx)
+                child_req = request if idx == n - 1 else copy(request)


are we okay with shallow copy here?

Yes since the EngineCoreRequest is just immediately serialized. Even in the in-proc case, none of the contents are mutated.

Signed-off-by: Nick Hill <nhill@redhat.com>

mergify · 2025-03-19T18:00:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

afeldman-nm

Thanks, it will be great to eliminate input processing redundancy. Mostly just nits in terms of feedback.

afeldman-nm · 2025-03-20T18:30:58Z

vllm/v1/engine/llm_engine.py

+                # 3) Make a new RequestState and queue.
+                self.output_processor.add_request(child_req, parent_req, idx)
+                # 3) Add the request to EngineCore.
+                self.engine_core.add_request(child_req)


Nit: it would seem that we are not enumerating processing steps in the prior comments, so "3)" can be eliminated.

afeldman-nm · 2025-03-20T18:37:19Z

vllm/v1/engine/async_llm.py

+            parent_req = ParentRequest(request_id, params)
+            for idx in range(n):
                request_id, params = parent_req.get_child_info(idx)
+                child_req = request if idx == n - 1 else copy(request)


Nit: why not

Suggested change

child_req = request if idx == n - 1 else copy(request)

child_req = request if idx == 0 else copy(request)

? Unless there is a specifically reason to pass the original request data structure to child n-1, why not simplify by passing it to child 0?

Just defensive again, we are dispatching each copy of the request so potentially better to not copy the ones that have already been sent.

afeldman-nm · 2025-03-20T18:43:43Z

vllm/v1/engine/async_llm.py

+        if n == 1:
+            await self._add_request(request, None, 0, queue)
+
+        else:


Not sure if worthwhile, but perhaps you could rewrite this section to make the else redundant:

if n == 1: await self._add_request(request, None, 0, queue) return queue # Fan out child requests (for n>1). parent_req = ParentRequest(request_id, params) for idx in range(n): request_id, params = parent_req.get_child_info(idx) child_req = request if idx == n - 1 else copy(request) child_req.request_id = request_id child_req.sampling_params = params await self._add_request(child_req, parent_req, idx, queue) return queue

I like this because it expresses that there is a "quick exit" when n==1 and a longer process otherwise.

afeldman-nm · 2025-03-20T18:44:43Z

vllm/v1/engine/llm_engine.py

+        # Process raw inputs into the request.
+        request = self.processor.process_inputs(request_id, prompt, params,
+                                                arrival_time, lora_request,
+                                                trace_headers,
+                                                prompt_adapter_request,
+                                                priority)


Nit: kwargs?

I just moved the call which was already there :)

afeldman-nm · 2025-03-20T18:45:01Z

vllm/v1/engine/async_llm.py

+        # Convert Input --> Request.
+        request = self.processor.process_inputs(request_id, prompt, params,
+                                                arrival_time, lora_request,
+                                                trace_headers,
+                                                prompt_adapter_request,
+                                                priority)
+


Nit: kwargs?

same as above

afeldman-nm · 2025-03-20T18:46:06Z

vllm/v1/engine/async_llm.py

+            parent_req = ParentRequest(request_id, params)
+            for idx in range(n):
                request_id, params = parent_req.get_child_info(idx)
+                child_req = request if idx == n - 1 else copy(request)


Nit: maybe child_request instead of child_req, if you prefer not to use abbreviations?

Sure. I would typically abbreviate if it helped to avoid line wrapping.

afeldman-nm · 2025-03-20T18:47:05Z

vllm/v1/engine/async_llm.py

+
+        else:
+            # Fan out child requests (for n>1).
+            parent_req = ParentRequest(request_id, params)


Nit: maybe parent_request, if you prefer not to use abbreviations?

afeldman-nm · 2025-03-20T18:47:38Z

vllm/v1/engine/llm_engine.py


+        else:
+            # Fan out child requests (for n>1).
+            parent_req = ParentRequest(request_id, params)


Nit: maybe parent_request, if you prefer not to use abbreviations?

afeldman-nm · 2025-03-20T18:48:15Z

vllm/v1/engine/llm_engine.py

+            parent_req = ParentRequest(request_id, params)
+            for idx in range(n):
+                request_id, params = parent_req.get_child_info(idx)
+                child_req = request if idx == n - 1 else copy(request)


Nit: maybe child_request, if you prefer not to use abbreviations?

…eproc # Conflicts: # vllm/v1/engine/async_llm.py

Signed-off-by: Nick Hill <nhill@redhat.com>

WoosukKwon · 2025-03-21T03:27:39Z

@njhill Please let me know if this PR is good to merge!

njhill · 2025-03-21T03:50:16Z

@WoosukKwon yes it's ready, thanks! I don't know why the CI tests have started randomly OOMing.

Signed-off-by: Nick Hill <nhill@redhat.com>

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Nick Hill <nhill@redhat.com>

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

[V1] Avoid redundant input processing in n>1 case

f562b22

Follow-on streamlining of V1 parallel sampling implementation. Includes removing the unnecessary/unused request_id arg from tokenizer encode methods. Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added the v1 label Mar 17, 2025

njhill requested a review from DarkLight1337 March 17, 2025 21:47

njhill requested review from WoosukKwon, alexm-redhat, comaniac, robertgshaw2-redhat, youkaichao, ywang96 and zhuohan123 as code owners March 17, 2025 21:47

robertgshaw2-redhat reviewed Mar 17, 2025

View reviewed changes

njhill added 2 commits March 18, 2025 08:54

fix beam_search()

8f5f378

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into dedup-preproc

51b71fd

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2025

njhill added 2 commits March 19, 2025 08:33

fix another missed test usage of encode_async

597f89e

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into dedup-preproc

b4cf0c0

mergify bot added the needs-rebase label Mar 19, 2025

afeldman-nm suggested changes Mar 20, 2025

View reviewed changes

njhill added 2 commits March 20, 2025 12:32

Merge remote-tracking branch 'refs/remotes/origin/main' into dedup-pr…

289b9f3

…eproc # Conflicts: # vllm/v1/engine/async_llm.py

address review comments

b6b0f6d

Signed-off-by: Nick Hill <nhill@redhat.com>

mergify bot removed the needs-rebase label Mar 20, 2025

afeldman-nm approved these changes Mar 20, 2025

View reviewed changes

WoosukKwon merged commit da6ea29 into vllm-project:main Mar 21, 2025
39 of 41 checks passed

njhill deleted the dedup-preproc branch March 21, 2025 14:29

erictang000 pushed a commit to erictang000/vllm that referenced this pull request Mar 25, 2025

[V1] Avoid redundant input processing in n>1 case (vllm-project#14985)

fd626c5

Signed-off-by: Nick Hill <nhill@redhat.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1] Avoid redundant input processing in n>1 case (vllm-project#14985)

6eb465b

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[V1] Avoid redundant input processing in n>1 case (vllm-project#14985)

f1dd743

Signed-off-by: Nick Hill <nhill@redhat.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1] Avoid redundant input processing in n>1 case (vllm-project#14985)

ab64334

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

	child_req = request if idx == n - 1 else copy(request)
	child_req = request if idx == 0 else copy(request)

Uh oh!

[V1] Avoid redundant input processing in n>1 case #14985

[V1] Avoid redundant input processing in n>1 case #14985

Uh oh!

Conversation

njhill commented Mar 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 19, 2025

Uh oh!

afeldman-nm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afeldman-nm Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Mar 21, 2025

Uh oh!

njhill commented Mar 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

njhill commented Mar 17, 2025 •

edited by github-actions bot

Loading

afeldman-nm Mar 20, 2025 •

edited

Loading