Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528

Open
7 of 17 tasks
comaniac opened this issue Aug 14, 2024 · 5 comments
Open
7 of 17 tasks

[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528

comaniac opened this issue Aug 14, 2024 · 5 comments
Labels
help wanted Extra attention is needed misc

Comments

@comaniac
Copy link
Collaborator

comaniac commented Aug 14, 2024

Co-authored with @SolitaryThinker @Yard1 @rkooo567

We are landing multi-step scheduling (#7000) to amortize scheduling overhead for better ITL and throughput. Since the first version of multi-step scheduling doesn't work with some existing features, this issue tracks the progress to support them so that multi-step scheduling could become a common and practical feature in vLLM.

Performance

Chunked Prefill

It is tricky for multi-step scheduling to work with chunked prefill because of the following reasons:

  1. Chunked prefill schedules prefill and decode requests to the same batch.
  2. Prefill requests only need a few steps (at maximum prompt_tokens / chunk_size steps), which could be much less than the configured multi-steps (i.e., 8).
  3. We cannot turn a prefill request into a decode request without re-scheduling and re-preparing inputs.

As a result, we need a schedule policy to deal with prefill requests in multi-step scheduling. Here are 2 possible policies we could consider at this moment:

  1. Force Single Step: We force single step when there are prefill requests in a batch. This may work well for offline batching, but not good for online serving because new requests keep coming.
  2. Ignore Prefill: We ignore prefill requests since the second step, meaning that prefill requests do nothing in (k-1) steps. This may work better for online serving.

Since there's no single schedule policy that works for all scenarios, it's better to implement both approaches and let users configure. Also we may come up with better policies in the future, we need to make these policies pluggable.

The action items are:

Misc

Functionality

@comaniac comaniac added misc help wanted Extra attention is needed labels Aug 14, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

@SolitaryThinker
Copy link
Contributor

SolitaryThinker commented Aug 19, 2024

Additions for tracking. I will take up both of these. cc @zhuohan123

@rkooo567
Copy link
Collaborator

I think we can also try making it work with new spmd architecture, which can simplify code and improve performance especially for pp

@SolitaryThinker
Copy link
Contributor

  • ADAG / SPMD integration
  • Lora support

@vrdn-23
Copy link
Contributor

vrdn-23 commented Oct 23, 2024

Is multi-step scheduling not supported with LoRA at all? Does that mean any LoRA requests that come in do not use the scheduling?
cc @SolitaryThinker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed misc
Projects
None yet
Development

No branches or pull requests

5 participants