-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator. #23336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
control it by the env flag VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE (default: True) Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor this to extend the current dispatching instead of using conditional imports?
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@ProExpertProg The requested modification is applied. |
…y and fix pre-commit error Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Just nits.
"""Dispatch the custom allreduce implementation based on the platform.""" | ||
if is_rocm_aiter_custom_allreduce_enabled(): | ||
from aiter.dist.custom_all_reduce import CustomAllreduce | ||
logger.info("Using aiter.dist.custom_all_reduce for ROCm platform") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: info_once
) | ||
|
||
self.ca_comm: Optional[CustomAllreduce] = None | ||
self.ca_comm: Optional[CustomAllreduce] = None # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think if you add a __call__
method to the CustomAllreduceProtocol
you can get rid of the #type: ignore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore unfortunately, this solution didnt workout. to get rid of #type: ignore
had to change the dtype into CustomAllreduceProtocol
and implemented additional methods and attributes required by the class CustomAllreduce
with this now mypy passes.
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@vllmellm @SageMoore I would suggest to have a new aiter_comm not as complete replacement of current custom allreduce but as an alternative to it, and enable/disable it by env or config. |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@ilmarkov Thank you for the suggestion. Indeed, having a separate module as I have added back separate individual environment flag for this based on your comment for enabling/disabling. |
@vllmellm Sorry, if I unclearly shared the idea. I am suggesting to add |
@ilmarkov I see what you mean—thank you for elaborating. I understand your point of view; however, the |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@vllmellm Isn't there any scenario when we can use aiter_comm for certain range of input sizes and ca_comm for the other, e.g. for better performance? |
@ilmarkov could you guide us on how we could evaluate the speed of the quick reduce while also take into consideration of the quantization error that is introduced by quick reduce? How should we set the quantization level? |
Purpose
Integrate aiter custom all reduce in cuda communicator, which boosts model performance.
This PR is tested on 5ee37dc commit from
aiter
package.Benchmark Results
deepseek-ai/DeepSeek-V3 tp8
meta-llama/Llama-4-Scout-17B-16E-Instruct tp8
Qwen/Qwen3-235B-A22B-FP8 tp4
benchmark setting
python vllm/benchmarks/benchmark_serving.py --backend vllm --model "$model_name" --dataset-name random --num-prompts 50 --request-rate 10 --random-input-len 1000 --random-output-len 1000
Test Plan
Test model that are afftected by this change, using lm_eval on gsm8k dataset.
environment setting
Step 1: run vllm serve
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE=1 SAFETENSORS_FAST_GPU=1
vllm serve $MODEL_NAME --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' --trust-remote-code --max-model-len 32768 -tp 8 --block-size 1 --swap-space 16 --distributed-executor-backend mp
Step 2: run lm_eval
lm_eval --model local-completions --tasks gsm8k --model_args model=$MODEL_NAME,base_url=http://localhost:8000/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 100
Test Results
deepseek-ai/DeepSeek-V3 tp8