-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: enable peer access between devices #2470
Conversation
I forgot to add: this PR would also enable NVLink if someone were to use it (P40s do not have NVLink). |
I tried this with 3090 Ti + 3080 on WSL2 and there isn't really a difference, however I am limited to 7B because it fails with "out of memory" errors for anything bigger than that. I suspect that this is because I cannot enable above 4G decoding with both GPUs on my system. The performance with 7B and the two GPUs is very bad for me, from 25 ms/token with a single GPU, to 170 ms/token with two GPUs. I wonder if writing the result directly to memory of the main GPU in the kernel would be faster than the memcpy. |
Direct communication without NVLink via PCIe only works on Linux I think. You should be able to check this via |
Yes you're right, Using the Windows native version of |
Some profiling data for perplexity calculations using 7b q4_0:
It seems that enabling peer access makes |
On 2x3090 it improved top t/s beyond exllama. I have actual nvlink. 13t/s in textgen vs 10t/s. For prompt processing, it stopped being noticeable on at least mid length contexts (1-2k). Time to first token was making me want to go back to GPTQ. Did not try to split a model across all 3 of my cards yet post change. I've got the same exact xeon v4 as you but all layers are offloaded. |
42c5d3c
to
8649481
Compare
I currently don't have GGUF models higher than 7b ready to re-test performance but for 7b the relative performance after rebasing seems to be the same. |
Was waiting on llama python bindings to catch up to test in the new implementation. When I ran 70b Q6 I added a string that peering was enabled. It enabled for the 3090s but not the P40. Still, speed for a 3 card model of 2 different generations isn't bad. This cannot be done with GPTQ without 3x3090. |
Greetings, from my understanding of this thread, the implementation slows down prompt processing but does in fact enable loading larger models into pooled VRAM. IMO this is a valuable feature, maybe enable merge depending on a CMAKE flag? |
Multi-GPU is already supported by llama.cpp. In this PR's current state, it's entirely a performance tradeoff between token generation and prompt processing when using multiple GPUs. |
As I said in the issue: the proper way to implement this would be to not just enable peer access but to then also adapt |
I would have to spin up on this specific (probably worth doing anyway?), if you already know what to do by all means its yours |
I do not get any kind of processing slowdown with proper nvlink. It just beats exllama, simple as. There is no tradeoff for me with fully offloaded models that I can tell. I rather get 17t/s than 10t/s. If there is a slowdown it will happen to 220+ t/s prompt processing and I'll take it. I am really unsure why people think there is pooled memory. Even with Nvlink it's not pooled. Simply the cards skip the step of sending anything back to the CPU and communicate directly. |
I’m noobing here. Explain how to load the 70B model with full cuda offload to two 3090ti. What hardware is required to upgrade from a system with one 3090ti to match your results? Nvlink not needed? |
I just tell it to offload 83 layers and then do the memory split like 41,42 You don't need nvlink but you'll get the 10 it/s rather than more. Definitely need 48GB and more for quants above 4KM. The highest I ran is Q6 but that requires using the P40 and then I only get 7 it/s |
What would be valuable additions if I request access to a100? How can I write a proposal for access based on this repo? Curious if it’s worth my time to try to get access to the big iron cluster or just add another card to my personal system |
Which A100? If it's the 40gb you will come up short. |
Hoping to program Ascend for something entertaining: https://www.osc.edu/resources/technical_support/supercomputers/ascend |
You can train quite well there. I would leave inference to 2x3090 setups.Training 70b on only 48g is slow, even as Q4. |
So would you recommend the proposal output to be new model weights after training the base 70B model with more data? Trying to envision why a project about llama.cpp would get funded. |
I don't know what proposal you could write. If you had an idea for a dataset to produce some kind of model I'm sure they would give you compute time. But just to run inference I think you are better off buying another card. Off the top of my head I would combine teatime+todd proxy outputs with first cleaned orca, guanaco and then airoboros to see which one would work better. But I doubt someone would fund my pet anti-alignment projects. |
So I tried an implementation where GPUs directly read to/write from the VRAM of other GPUs but as it turns out that is slower than copying in one batch at the beginning or the end. So I think the way to go is to add a toggle (with enabled peer access being the default). More generally, it seems that the biggest source of kernel launch overhead is the cuBLAS GEMM for the KV cache (due to the many attention heads). For token generation the overhead is much lower because the kernels I wrote can process all attention heads at once and I think that's why enabling peer access makes it faster. |
You did something because exllama also enables peer access and it's slower. |
After #3110 I finally understand the prompt processing performance regression from enabling peer access. The problem is that by default the data goes device -> host -> device with buffering on the host. If you enable peer access the data goes directly device -> device with no buffering so for 2+ GPUs communicating with the main device they can block each others' data transfer and the performance goes down. Unfortunately you apparently cannot just toggle peer to peer data transfer on a per-API-call basis though. The proper way to fix this would be to reduce the bandwidth needed for data transfers (e.g. by transferring f16 or q8_0) or by manually implementing a buffered device -> host -> device data transfer for the writeback to the main GPU. But for now I think we can add a function that sets peer access depending on batch size and NVLink availability and call it at the beginning of each eval. |
Actually, now that I think about it it may be possible to fix the performance by reordering the data transfers via CUDA events. |
Won't this do nothing as is because I took this out when testing. |
|
Thanks. I was not aware. |
8649481
to
9ae9b86
Compare
I tried an implementation where I used CUDA events to enforce a specific data transfer order but the performance was still worse than with peer access enabled. I now did an implementation that at the beginning of an eval enables or disables peer access based on batch size. The threshold for peer access can be controlled via a compile option
With NVLink enabled a higher value would probably perform better but unfortunately CUDA does not seem to let you query NVLink status between devices. |
Could this be done without adding a new API? Maybe in |
9ae9b86
to
212aab2
Compare
212aab2
to
3f2b38f
Compare
3f2b38f
to
02540ff
Compare
I've been using nbatch of 512 and it definitely speeds up with nvlink. If it turned it off unless batch is 128 or less then it would basically never use it. What I am wondering is what this does with P40 because I now have 3 of them and peer access never enables between them in linux (i added a print string that it succeeded). Only see it on 3090s. |
If you compile with
Could be an issue with the motherboard having to support it or NUMA affinity for dual CPU systems, I don't know. |
There are at least 2 on one numa node. The MB is designed for P40 support. Probably why adding the printf would have been good here because then you know. If any got enabled. In any case I'll compile it with the high batch and see what happens and if it's all as before. |
If those 2 are on a different NUMA node than the one set as main device peer access can still not be enabled. Only main device <-> other devices peer access is ever enabled. |
So then I can either enable nvlink for the 3090s or put all the P40s on one proc and set one as main to enable access between them? And peer access won't work between the 3090s and the single P40 via PCIE despite being on the same numa because of generation difference? I just tested this with some print strings and access enables like before when I pumped the max to 4096. This time it turns on at first generation rather than at initialization. |
For some reason, this becomes unbearably slow when using two A6000 from vast.ai. This is not fixed by the follow-up PR #3231. I'm not sure what the reason for this might be.1 The slowdown happens only when I activate both A6000. If I set This is the output from
Is there a way to disable the new feature at runtime? If you need more information, please let me know. I'm using the Docker image Footnotes
|
There is no option to disable it at runtime but you should be able to disable it by compiling with |
I ran into the same thing as @sgoll today and managed to track it down to this commit. I'm using my own hardware but have both cards (P40s) passed to a linux VM instead of using them directly. My guess is that virtualization (HyperV, in my case) somehow interferes with the peer access logic. Maybe having multiple CPUs/NUMA nodes also contributes to it. Setting the peer access compile flag to zero fixes it as expected. It seems like a pretty rare edge case, considering this is the only mention of this issue I saw. Still, might be worth adding a note about it to the |
Another edge case is when loading a llama.cpp model after using exllama. Peer access is already enabled and it goes down. But I just restart it. |
This PR enables peer access between CUDA devices if possible. As a consequence devices can communicate directly via PCIe instead of using the CPU as an intermediary. This makes token generation faster:
However, for some reason that I don't yet understand it also makes prompt processing slightly slower. Peer access makes memory allocation slower but I don't think that is the cause. In any case, I think even with the decrease in prompt processing speed this would be a net positive.