Custom all reduce kernels #2192

hanzhi713 · 2023-12-19T05:20:13Z

See this doc for detailed writeup and experiments

Latency-optimal allreduce and cuda graph optimization.pdf

Latency and memory

Tested with

python benchmarks/benchmark_latency.py \
--model NAME -tp TP --input-len 256 --output-len 256 --batch-size BS --num-iters 5

L = Latency, M = Memory

Model	GPU	TP	BS	L before (s)	L after (s)	M before (MB)	M after (MB)
Llama 70B	A100-80G	4	64	16.37	14.57	76723	75717
Llama 33B	A100-80G	2	64	15.88	14.38	75739	74741
Llama 13B	A30 (no nvlink)	2	32	10.43	9.79	23519	22911
Llama 7B	T4	2	32	17.90	17.51	14969	14475

Hypothesis on why memory usage is lower with fast allreduce:

NCCL's internal buffer is captured in the graph
NCCL requires inserting more nodes per invocation. For example, NCCL requires a few host nodes to ensure proper operations.

Throughput

Model	GPU	TP	Throughput before	Throughput after
Llama 70B	A100-80G	4	3.68 requests/s, 1761.33 tokens/s	3.87 requests/s, 1852.65 tokens/s

Performance and memory note

NVswitch based systems should observe higher performance improvement than PCIe systems. Generally, the faster the link, the higher the performance improvement.
Latency improvement is more significant at smaller batch sizes, when allreduce is more latency bound.
The smaller memory overhead of fast allreduce can lead to higher throughput and alleviate some OOM issues when GPU memory budget is tight (e.g. serving 33b with 4xA30).

Implementation note

Since I originally implemented fast allreduce on top of my own fork, I made some changes compared to the original one in the doc. Note that the performance numbers in the writeup doc are not valid because my fork differs significantly from the upstream. Main changes are

No fusion with residual connection: this is because it's already fused with layernorm
No cuda graph replay optimizations. @WoosukKwon's cuda graph implementation uses a single graph launch per model only (mine needs one per layer), so that's probably not necessary.

There are also extensive effort made to make it work with cuda graph automatically (automatic IPC buffer registration). My previous implementation requires manually allocating a global buffer and changing model code to write matmul's output to it.

The one-hop and two-hop all reduce kernels work very similar to Nvidia TensorRT-LLM's kernels (https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu). However, there were developed independently before TensorRT-LLM's release

Note on some source files

fast_allreduce.cuh is the implementation without pytorch dependencies. You can copy this file to compiler explorer and check the PTX/SASS
fast_allreduce_test.cu is a C++ test for performance and accuracy comparison between NCCL and my implementation. It's fast to compile compared to the torch extension. Code there isn't very neat.

Caveats

Compared to NCCL allreduce, there are some caveats for the fast allreduce path.

Only work for tensor whose byte size is multiple of 16
Can only work out-of-place for now.
Doesn't work with hybrid parallelism for now (e.g. TP + PP). I don't know if there are planned to be supported with vLLM

1 should be automatically handled and checked. 2 should be a non-issue since all usage of tensor_model_parallelism uses its return value.

TODOs

add configuration option
more end-to-end performance testing on other GPUs, model sizes and TP configs
end-to-end correctness test with models
format code
~~[ ] (maybe) nit: bind C++ class properly with pybind (not using C style binding)~~ Since we don't want to introduce pytorch dependencies to the header file, we need an additional layer of wrapper anyway.

hanzhi713 · 2023-12-19T05:25:23Z

It's not quite ready to merge. I'm requesting for comments.

cc @WoosukKwon @simon-mo

WoosukKwon · 2023-12-20T23:29:59Z

@hanzhi713 This is awesome! Many thanks for the PR! A quick question: do you happen to know about the custom all-reduce kernels in TRT-LLM? Is this PR related to the kernel?

hanzhi713 · 2023-12-21T01:43:59Z

This is included in the PR description

The one-hop and two-hop all reduce kernels work very similar to Nvidia TensorRT-LLM's kernels (https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu). However, there were developed independently before TensorRT-LLM's release

hanzhi713 · 2023-12-21T05:56:15Z

@WoosukKwon Correctness and functionality wise this PR should be ready. Checked a few models and there are only occasional generation differences (due to numerical differences). See the diff below for reference. Left is without fast allreduce and right is with fast allreduce.

https://www.diffchecker.com/hiJejMpy/

Tested with

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
] * 32
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=32)

# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-70B-fp16", tensor_parallel_size=8, disable_fast_allreduce=True) # or False for fast allreduce
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

casper-hansen · 2023-12-21T13:23:17Z

A 5% throughput improvement is quite impressive from optimizing all reduce with custom kernels. Well done!

hanzhi713 · 2023-12-21T19:33:20Z

A 5% throughput improvement is quite impressive from optimizing all reduce with custom kernels. Well done!

Yes, considering this is mainly an latency optimization

scv119 · 2023-12-26T03:00:20Z

@hanzhi713 have you compared pytorch/pytorch#114001 with your custom reduce ops?

WoosukKwon

Hi @hanzhi713, thanks again for the awesome PR! Did one third of review, mostly on code style. Will look into the actual implementation.

setup.py

vllm/worker/model_runner.py

vllm/model_executor/parallel_utils/fast_allreduce.py

vllm/engine/arg_utils.py

vllm/worker/model_runner.py

vllm/model_executor/parallel_utils/communication_op.py

tests/distributed/test_fast_allreduce.py

vllm/model_executor/parallel_utils/fast_allreduce.py

csrc/fast_allreduce_test.cu

WoosukKwon · 2023-12-26T04:19:41Z

@hanzhi713 BTW I got this error when using 2 L4 GPUs:

(RayWorkerVllm pid=51757) INFO 12-26 04:18:45 fast_allreduce.py:21] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
(RayWorkerVllm pid=51757) Failed: Cuda error /home/gcpuser/workspace/vllm/csrc/fast_allreduce.cuh:368 'peer access is not supported between these two devices'

vllm/engine/arg_utils.py

hanzhi713 · 2023-12-26T23:43:58Z

@hanzhi713 BTW I got this error when using 2 L4 GPUs:

(RayWorkerVllm pid=51757) INFO 12-26 04:18:45 fast_allreduce.py:21] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
(RayWorkerVllm pid=51757) Failed: Cuda error /home/gcpuser/workspace/vllm/csrc/fast_allreduce.cuh:368 'peer access is not supported between these two devices'

I guess I have to check this. While all topologies that I have access to support P2P, some platforms don't.

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

hanzhi713 · 2024-01-23T09:10:05Z

You shouldn't remove this. I used cuda driver's API and must link cuda driver for it to work.

@hanzhi713 Yeah I just realized it. Thanks for letting us know. BTW, as you might have noticed, I'm making minor changes (mostly code styles and imports) to accelerate the merge. I will push my review comments soon. Again, apologies for the delay.

Haha it's fine. Writing review is often slower than modifying the code directly

WoosukKwon

Hi @hanzhi713, thanks again for the great work!

I left some comments mainly on the C++ part, as I still need a bit more time to complete my review on the CUDA kernel part. Overall, I learned a lot while reading your code and really appreciate it. However, it seems the code can be improved in terms of simplicity. Please take a look at my review comments.

vllm/config.py

vllm/worker/model_runner.py

tests/distributed/test_comm_ops.py

vllm/model_executor/parallel_utils/custom_all_reduce.py

csrc/custom_all_reduce.cuh

vllm/model_executor/parallel_utils/custom_all_reduce.py

csrc/custom_all_reduce.cuh

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

csrc/custom_all_reduce.cuh

WoosukKwon · 2024-01-25T08:54:15Z

csrc/custom_all_reduce_test.cu

Just curious: Can we port this test to Python so that we can use Ray instead of MPI? This change would also make it easier to include this test into our CI.

Might be possible, but it's tricky to get the performance measurement correct in Python, especially for NCCL kernels. Each kernel's runtime is so short (<=10us for the smallest size) that removing any overhead is important.

WoosukKwon · 2024-01-25T09:33:33Z

vllm/model_executor/parallel_utils/custom_all_reduce.py

+
+_CA_HANDLE = None
+_IS_CAPTURING = False
+_SUPPORTED_WORLD_SIZES = [2, 4, 6, 8]


Just curious: Why should the number of GPUs be even? Which part of the code should we fix if we want to support odd number of GPUs?

Well I think my kernels do support old #GPUs. I just never test them. Do other parts of vLLM support old number of GPUs (e.g. tensor parallel linear)?

WoosukKwon · 2024-01-25T09:36:52Z

csrc/custom_all_reduce.cuh

+    if (world_size_ == 2) {                           \
+      KL(ngpus, cross_device_reduce_1stage);          \
+    } else if (full_nvlink_) {                        \
+      if ((world_size_ <= 4 && bytes < 512 * 1024) || \
+          (world_size_ <= 8 && bytes < 256 * 1024)) { \
+        KL(ngpus, cross_device_reduce_1stage);        \
+      } else {                                        \
+        KL(ngpus, cross_device_reduce_2stage);        \
+      }                                               \


I actually don't fully understand the underlying principle behind this kernel selection process. How did you set the thresholds (512KB and 256KB)? Why are the thresholds different for ngpus=4 and ngpus=8?

Thresholds are results after tuning. Different thresholds result from the latency-bandwidth trade-off between the two methods.

WoosukKwon

LGTM! Many thanks again for submitting the PR and addressing the reviews. I made minor changes again to merge the PR asap. Hope this is ok with you.

Besides, there are a few items we'd hope to work on this thread:

Enabling this optimization for cloud A10G/L4 GPUs. For some reason, CUDA P2P access is not possible for these GPUs in cloud environments. We need to investigate the problem.
Refactoring the code for FastAllreduce initialization and buffer registration. I feel we can simplify this further.

We'd be happy to have your (and anyone's) inputs to the above items!

hanzhi713 · 2024-02-01T04:32:59Z

@Yard1 I noticed your comment on multiple captures. On my end, multiple captures work and will produce correct results (using examples/offline_inference.py). Also, my unit test (tests/distributed/test_custom_all_reduce.py) uses multiple capture too.

I'm not quite sure how you're using it. I just moved with custom_all_reduce.capture(): to the first line of CUDAGraphRunner.capture.

hanzhi713 and others added 3 commits December 19, 2023 05:00

a working impl & squash commits

3cc29ca

add missing cuda free

89f8b97

link driver

53ed0f9

hanzhi713 added 4 commits December 19, 2023 05:44

add more notes

39000b8

add todo

f4dc283

add flag and format

2f49454

fix

16447e5

hanzhi713 force-pushed the fast_ar_sq branch from d6acc8b to 16447e5 Compare December 19, 2023 23:46

hanzhi713 added 5 commits December 19, 2023 23:54

fix arg passing

60a51f2

Merge branch 'main' into fast_ar_sq

b82c9d5

trailing comma

2150a90

use pytest for fast allreduce test

cd898ba

Merge branch 'main' into fast_ar_sq

41115c7

hanzhi713 added 2 commits December 21, 2023 03:54

small refactor

15672a9

cleanup code add verify correctness

ad7d220

hanzhi713 changed the title ~~[WIP] Custom all reduce kernels~~ Custom all reduce kernels Dec 21, 2023

improve test robustness

da5772e

WoosukKwon self-requested a review December 21, 2023 07:09

WoosukKwon reviewed Dec 26, 2023

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

Apply suggestions from code review

7644f7c

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon reviewed Jan 23, 2024

View reviewed changes

hanzhi713 and others added 9 commits January 23, 2024 19:08

Apply suggestions from code review

c5b4212

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

add notes

60e013a

add size check

10a906e

Update csrc/custom_all_reduce.cuh

b896fbd

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Apply suggestions from code review

aae30fb

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

grammar

627a49f

move test to c++

c7e3704

add warnings and do few renames

036bb68

Merge branch 'main' into fast_ar_sq

c6367aa

WoosukKwon reviewed Jan 25, 2024

View reviewed changes

csrc/custom_all_reduce.cuh Show resolved Hide resolved

WoosukKwon reviewed Jan 25, 2024

View reviewed changes

WoosukKwon added 5 commits January 27, 2024 07:27

Fix custom all reduce tests

e1e802e

Move test_utils to tests/distributed/utils

bbfc263

Merge branch 'main' into fast_ar_sq

75efb8a

Minor

6f61347

Roll back to test_utils

c09772c

WoosukKwon approved these changes Jan 27, 2024

View reviewed changes

WoosukKwon merged commit 3801700 into vllm-project:main Jan 27, 2024
17 checks passed

yippp mentioned this pull request Jan 29, 2024

Error: resource already mapped in custom_all_reduce.cuh #2641

Closed

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

Implement custom all reduce kernels (vllm-project#2192)

eccd4b4

umarbutler mentioned this pull request Feb 9, 2024

Distributed inference on multi machine (error Invalid peer device id) #2795

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Implement custom all reduce kernels (vllm-project#2192)

57f44e5

alexm-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Implement custom all reduce kernels (vllm-project#2192)

dbf5d29

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom all reduce kernels #2192

Custom all reduce kernels #2192

hanzhi713 commented Dec 19, 2023 •

edited

Loading

hanzhi713 commented Dec 19, 2023

WoosukKwon commented Dec 20, 2023

hanzhi713 commented Dec 21, 2023

hanzhi713 commented Dec 21, 2023 •

edited

Loading

casper-hansen commented Dec 21, 2023

hanzhi713 commented Dec 21, 2023

scv119 commented Dec 26, 2023

WoosukKwon left a comment

WoosukKwon commented Dec 26, 2023

hanzhi713 commented Dec 26, 2023

hanzhi713 commented Jan 23, 2024

WoosukKwon left a comment

WoosukKwon Jan 25, 2024

hanzhi713 Jan 25, 2024 •

edited

Loading

WoosukKwon Jan 25, 2024

hanzhi713 Jan 25, 2024

WoosukKwon Jan 25, 2024

hanzhi713 Jan 25, 2024

WoosukKwon left a comment

hanzhi713 commented Feb 1, 2024 •

edited

Loading

Custom all reduce kernels #2192

Custom all reduce kernels #2192

Conversation

hanzhi713 commented Dec 19, 2023 • edited Loading

Latency and memory

Throughput

Performance and memory note

Implementation note

Note on some source files

Caveats

TODOs

hanzhi713 commented Dec 19, 2023

WoosukKwon commented Dec 20, 2023

hanzhi713 commented Dec 21, 2023

hanzhi713 commented Dec 21, 2023 • edited Loading

casper-hansen commented Dec 21, 2023

hanzhi713 commented Dec 21, 2023

scv119 commented Dec 26, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon commented Dec 26, 2023

hanzhi713 commented Dec 26, 2023

hanzhi713 commented Jan 23, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Jan 25, 2024

Choose a reason for hiding this comment

hanzhi713 Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon Jan 25, 2024

Choose a reason for hiding this comment

hanzhi713 Jan 25, 2024

Choose a reason for hiding this comment

WoosukKwon Jan 25, 2024

Choose a reason for hiding this comment

hanzhi713 Jan 25, 2024

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

hanzhi713 commented Feb 1, 2024 • edited Loading

hanzhi713 commented Dec 19, 2023 •

edited

Loading

hanzhi713 commented Dec 21, 2023 •

edited

Loading

hanzhi713 Jan 25, 2024 •

edited

Loading

hanzhi713 commented Feb 1, 2024 •

edited

Loading