-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Add experimental Dual-Batch Overlap mechanism to VLLM #20448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
SageMoore
wants to merge
124
commits into
vllm-project:main
from
neuralmagic:sage/dbo-eager-decode-only
Closed
Add experimental Dual-Batch Overlap mechanism to VLLM #20448
SageMoore
wants to merge
124
commits into
vllm-project:main
from
neuralmagic:sage/dbo-eager-decode-only
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
…dbo-eager-decode-only
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Contributor
Author
|
I'm reverting this back to a draft state while I break off various components and merge them separately. The first will be #21153. |
Signed-off-by: Sage Moore <sage@neuralmagic.com>
37 tasks
…dbo-eager-decode-only
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
32 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR was done in collaboration with @LucasWilkinson
This PR adds experimental support for Dual-Batch Overlap in VLLM. In it's current state it will only be abled when a user provides the
--enable-microbatchingflag. Futhermore, it will only be used when all DP groups are running full-decode batches.This PR is purely infrastructural, meaning it's slow. We will attempt to improve performance, assuming this approach is accepted, in follow on PRs. The immediate next step is to add support for full cudagraphs.
To implement Dual-Batch Overlap (DBO), at a high level, we split the batch into two microbatches. Then using two threads and two cuda streams, one for communication and one for computation, to overlap the dispatch and combine all-to-all kernels of one microbatch with the compute kernels of the other microbatch.
When microbatching is enabled and supported, the
GPUModelRunnerwill split the batch into two token_slices. These token_slices are then passed into the attention meta data builders during_prepare_inputsto generate one attention metadata object per-microbatch. When actually running the model, the model runner will spawn off two microbatching threads that will each communicate with each other using aUBatchContext. Each of these threads will then runself.modelwith the appropriate attention meta data.Without any additional modifications to the code, this will just result in one microbatch running to completion before the other microbatch starts. In order to get overlaps, we've added a "yield" call that can be inserted into the all-to-all kernels to interleave the two microbatches. The
yield_and_switch_from_compute_to_commfunction yield the CPU from this thread (thread A) to the other microbatching thread (thread B). Once thread A has resumed execution, either because thread B yielded the CPU or finished it's execution, it will swap over to the communication stream and start dispatching kernels there.yield_and_switch_from_comm_to_computebehaves similarly but in the opposite direction. It swaps from the communication stream to the compute stream.There are both GPU and CPU events to synchronize all of this. That being said, it is absolutely critical that only one microbatching thread is running at a time, meaning the other one is waiting on an event. It is also absolutely critical that both microbatches are running the exact same number of yields.
lm_eval results
With Microbatching
VLLM_ALL2ALL_BACKEND=pplx vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --port 4444 --disable-log-requests --enforce-eager --enable-microbatchingVLLM_ALL2ALL_BACKEND=pplx vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --tensor-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --port 4444 --disable-log-requests --enforce-eager --enable-microbatchingWithout Microbatching
VLLM_ALL2ALL_BACKEND=pplx vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --port 4444 --disable-log-requests --enforce-eagerVLLM_ALL2ALL_BACKEND=pplx vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --tensor-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --port 4444 --disable-log-requests --enforce-eager