-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[Feature][Rocm] add quick all reduce for rocm #19744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Rocm] add quick all reduce for rocm #19744
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lihaoyang-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances VLLM's distributed communication capabilities on AMD GPUs by introducing a highly optimized 'Quick All-Reduce' feature. It integrates a specialized quickreduce library, providing support for various quantization levels to accelerate all-reduce operations. The changes include adding low-level C++/HIP kernels, exposing them through Python bindings, and implementing intelligent dispatch logic to automatically select the most performant communication strategy based on tensor properties and system configuration, aiming to improve performance for large tensor all-reduces.
Highlights
- New Feature: Quick All-Reduce: Introduced a 'Quick All-Reduce' feature specifically optimized for AMD GPUs (ROCm), leveraging the
quickreducelibrary for enhanced distributed communication performance. - Quantization Support: The new quick all-reduce supports various quantization levels, including FP16, INT8, INT6, and INT4, allowing for flexible performance tuning based on precision requirements.
- Intelligent Dispatch Logic: Integrated the quick all-reduce into VLLM's existing custom all-reduce framework, implementing logic to dynamically select between the standard custom all-reduce and the new quick all-reduce based on message size, data type, and the distributed world size.
- Low-Level Kernel Implementation: Added new C++/HIP kernels and Python bindings for the quick all-reduce operations, including specialized code for efficient IPC memory handling and GPU-specific intrinsics for AMD's CDNA architecture.
- Configurable Behavior: Introduced new environment variables (
VLLM_ROCM_QR_QUANT_REGIMEandVLLM_ROCM_QR_CAST_BF16_TO_FP16) to allow users to configure the quick all-reduce's quantization level and bfloat16 casting behavior.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a 'Quick All Reduce' feature, primarily targeting ROCm/HIP environments. It adds new C++/HIP kernels for an accelerated all-reduce operation, including support for quantization (FP16, INT8, INT6, INT4). The changes include new source files for the QuickReduce implementation, CMake build system updates, PyTorch C++ bindings, Python wrappers, and integration into the CustomAllreduce class. Key areas of feedback include a critical bug in a quantization codec, a potential runtime error in Python logic, and a suggestion for type clarity in C++ header declarations.
|
|
|
This pull request has merge conflicts that must be resolved before it can be |
e9fff8c to
f314fe4
Compare
|
|
f314fe4 to
381009f
Compare
381009f to
d5832a8
Compare
3b81d4d to
811be44
Compare
|
Can you merge from main to see if the CI failures are resolved? |
|
This pull request has merge conflicts that must be resolved before it can be |
aa4f696 to
51b9e6c
Compare
2f23c69 to
0209b2e
Compare
1552e5a to
c8c63dd
Compare
e7e84da to
7b7822c
Compare
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
61a5f33 to
2b52580
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good and great work on this PR! I left a few more comments, but I think they should be easy to address, thx!
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@lihaoyang-amd do you know why are failing? Was it already failing? |
Yes, you can refer to some pr that was merged before this one in commit history. |
|
@lihaoyang-amd Thanks as I found out in #17888 (comment) it has indeed been failing for at least a few commits/days earlier (see #20045 and buildkite.com/vllm/ci/builds/22676#0197a8d1-7fc7-4a19-bb6d-c1664f589dc9) |
Just For ROCM
1.Add quickreduce alternative to custom allreduce and rccl. (In case of large amount of data, custom quick reduce is used instead of custom allreduce and rccl, you can refer to the results of kernel tests.)
2.The collective is only enabled on AMD, MI300, for fp16/bf16 inputs and when custom allreduce is enabled. The kernels support full precision and quantized int8, int6, int4 (symmetric quantization with group size 32) all reduce collective quantization algorithm.
3.The quickreduce can be enabled by setting
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=[NONE|FP|INT8|INT6|INT4]env variable. quickreduce supports int8, int6, int4 quantization. NONE means turn off quick allreduce.4.PR supports fp16 and bf16 kernels but given the lack of intrinsics of bf16 math operations, bf16 kernels performance is worse (see kernel benchmark results below), so by default we convert bf16 all reduce input to fp16. To disable this behavior, set the environment variable
VLLM_ROCM_QR_CAST_BF16_TO_FP16=0.5.As long as quickreduce only get the performance benefits at middle/higher input sizes (see kernel benchmarks), vllm keeps using custom allreduce for small inputs. The lower limit for enabling quickreduce is chosen based on experimental results
6.The default maximum input size of quickreduce is 2GB, for users with limited video memory, the preset buffer may be a bit too large, you can adjust the value in MB by
VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB.Kernels benchmark
Baseline is custom allreduce when the data size is less than 16MB and rccl when the data size is greater than 16MB
TP=2
TP=4
TP 8
E2E server benchmark float16
Server:
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --block_size=32 --disable-log-requests --no-enable-prefix-caching -tp $tp --dtype float16Client:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-70B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts 500--request-rate 10 --ignore-eosTP=8
TP=4
E2E server benchmark bfloat16
Server:
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve model_path --block_size=32 --disable-log-requests --no-enable-prefix-caching -tp $tp --dtype autoClient:
python benchmarks/benchmark_serving.py --model model_path --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts 500--request-rate 10 --ignore-eosuse VLLM_ROCM_QR_CAST_BF16_TO_FP16=1
Evaluation results
on MMLU benchmark (LLaMa 3.1 70B, TP=8)
on GSM8K(use bf2fp16 by envs.VLLM_ROCM_QR_CAST_BF16_TO_FP16=1)
The initial PR was proposed by @ilmarkov , and we collaborated on it.