-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom all reduce kernels #2192
Conversation
It's not quite ready to merge. I'm requesting for comments. |
d6acc8b
to
16447e5
Compare
@hanzhi713 This is awesome! Many thanks for the PR! A quick question: do you happen to know about the custom all-reduce kernels in TRT-LLM? Is this PR related to the kernel? |
This is included in the PR description
|
@WoosukKwon Correctness and functionality wise this PR should be ready. Checked a few models and there are only occasional generation differences (due to numerical differences). See the diff below for reference. Left is without fast allreduce and right is with fast allreduce. https://www.diffchecker.com/hiJejMpy/ Tested with from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
] * 32
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=32)
# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-70B-fp16", tensor_parallel_size=8, disable_fast_allreduce=True) # or False for fast allreduce
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
A 5% throughput improvement is quite impressive from optimizing all reduce with custom kernels. Well done! |
Yes, considering this is mainly an latency optimization |
@hanzhi713 have you compared pytorch/pytorch#114001 with your custom reduce ops? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hanzhi713, thanks again for the awesome PR! Did one third of review, mostly on code style. Will look into the actual implementation.
@hanzhi713 BTW I got this error when using 2 L4 GPUs:
|
I guess I have to check this. While all topologies that I have access to support P2P, some platforms don't. |
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Haha it's fine. Writing review is often slower than modifying the code directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hanzhi713, thanks again for the great work!
I left some comments mainly on the C++ part, as I still need a bit more time to complete my review on the CUDA kernel part. Overall, I learned a lot while reading your code and really appreciate it. However, it seems the code can be improved in terms of simplicity. Please take a look at my review comments.
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
csrc/custom_all_reduce_test.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Can we port this test to Python so that we can use Ray instead of MPI? This change would also make it easier to include this test into our CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be possible, but it's tricky to get the performance measurement correct in Python, especially for NCCL kernels. Each kernel's runtime is so short (<=10us for the smallest size) that removing any overhead is important.
|
||
_CA_HANDLE = None | ||
_IS_CAPTURING = False | ||
_SUPPORTED_WORLD_SIZES = [2, 4, 6, 8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Why should the number of GPUs be even? Which part of the code should we fix if we want to support odd number of GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I think my kernels do support old #GPUs. I just never test them. Do other parts of vLLM support old number of GPUs (e.g. tensor parallel linear)?
if (world_size_ == 2) { \ | ||
KL(ngpus, cross_device_reduce_1stage); \ | ||
} else if (full_nvlink_) { \ | ||
if ((world_size_ <= 4 && bytes < 512 * 1024) || \ | ||
(world_size_ <= 8 && bytes < 256 * 1024)) { \ | ||
KL(ngpus, cross_device_reduce_1stage); \ | ||
} else { \ | ||
KL(ngpus, cross_device_reduce_2stage); \ | ||
} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't fully understand the underlying principle behind this kernel selection process. How did you set the thresholds (512KB and 256KB)? Why are the thresholds different for ngpus=4 and ngpus=8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Many thanks again for submitting the PR and addressing the reviews. I made minor changes again to merge the PR asap. Hope this is ok with you.
Besides, there are a few items we'd hope to work on this thread:
- Enabling this optimization for cloud A10G/L4 GPUs. For some reason, CUDA P2P access is not possible for these GPUs in cloud environments. We need to investigate the problem.
- Refactoring the code for
FastAllreduce
initialization and buffer registration. I feel we can simplify this further.
We'd be happy to have your (and anyone's) inputs to the above items!
@Yard1 I noticed your comment on multiple captures. On my end, multiple captures work and will produce correct results (using examples/offline_inference.py). Also, my unit test (tests/distributed/test_custom_all_reduce.py) uses multiple capture too. I'm not quite sure how you're using it. I just moved |
See this doc for detailed writeup and experiments
Latency-optimal allreduce and cuda graph optimization.pdf
Latency and memory
Tested with
L = Latency, M = Memory
Hypothesis on why memory usage is lower with fast allreduce:
Throughput
Performance and memory note
Implementation note
Since I originally implemented fast allreduce on top of my own fork, I made some changes compared to the original one in the doc. Note that the performance numbers in the writeup doc are not valid because my fork differs significantly from the upstream. Main changes are
There are also extensive effort made to make it work with cuda graph automatically (automatic IPC buffer registration). My previous implementation requires manually allocating a global buffer and changing model code to write matmul's output to it.
The one-hop and two-hop all reduce kernels work very similar to Nvidia TensorRT-LLM's kernels (https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu). However, there were developed independently before TensorRT-LLM's release
Note on some source files
Caveats
Compared to NCCL allreduce, there are some caveats for the fast allreduce path.
1 should be automatically handled and checked. 2 should be a non-issue since all usage of
tensor_model_parallelism
uses its return value.TODOs
[ ] (maybe) nit: bind C++ class properly with pybind (not using C style binding)Since we don't want to introduce pytorch dependencies to the header file, we need an additional layer of wrapper anyway.