Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: enable peer access between devices #2470

Merged
merged 1 commit into from
Sep 17, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

This PR enables peer access between CUDA devices if possible. As a consequence devices can communicate directly via PCIe instead of using the CPU as an intermediary. This makes token generation faster:

GPU Model Test t/s 1x P40 t/s master t/s PR Speedup
3x P40 7b q4_0 tg128 50.29 44.48 56.23 1.26
3x P40 13b q4_0 tg128 27.86 30.32 37.46 1.24
3x P40 33b q4_0 tg128 12.11 15.85 18.54 1.17
3x P40 70b q6_K tg128 - 7.36 8.15 1.11
3x P40 7b q4_0 pp 703.32 384.37 365.01 0.95
3x P40 13b q4_0 pp 384.52 243.87 224.95 0.92
3x P40 33b q4_0 pp 158.15 121.24 111.78 0.92
3x P40 70b q6_K pp - 58.15 54.86 0.94

However, for some reason that I don't yet understand it also makes prompt processing slightly slower. Peer access makes memory allocation slower but I don't think that is the cause. In any case, I think even with the decrease in prompt processing speed this would be a net positive.

@JohannesGaessler
Copy link
Collaborator Author

I forgot to add: this PR would also enable NVLink if someone were to use it (P40s do not have NVLink).

@slaren
Copy link
Collaborator

slaren commented Jul 31, 2023

I tried this with 3090 Ti + 3080 on WSL2 and there isn't really a difference, however I am limited to 7B because it fails with "out of memory" errors for anything bigger than that. I suspect that this is because I cannot enable above 4G decoding with both GPUs on my system. The performance with 7B and the two GPUs is very bad for me, from 25 ms/token with a single GPU, to 170 ms/token with two GPUs.

I wonder if writing the result directly to memory of the main GPU in the kernel would be faster than the memcpy.

@JohannesGaessler
Copy link
Collaborator Author

Direct communication without NVLink via PCIe only works on Linux I think. You should be able to check this via nvidia-smi topo -p2p r.

@slaren
Copy link
Collaborator

slaren commented Jul 31, 2023

Yes you're right, cudaDeviceCanAccessPeer returns 0. nvidia-smi says CNS = Chipset not supported, which again seems related to the above 4G decoding issue.

Using the Windows native version of nvidia-smi doesn't even recognize the argument, ERROR: Option topo is not recognized. Please run 'nvidia-smi -h'.. So I guess this is not going to work under Windows regardless.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Jul 31, 2023

Some profiling data for perplexity calculations using 7b q4_0:

CUDA API call Total no peer access Per call no peer access Total with peer access Per call with peer access
cudaMemcpyAsync 11.798 s 2.362 ms 41.563 ms 8.322 µs
cudaLaunchKernel 546.796 ms 8.340 µs 7.537 s 114.965 µs
cudaMemcpy2DAsync 261.229 ms 9.503 µs 5.234 s 190.392 µs

It seems that enabling peer access makes cudaMemcpyAsync much faster but it also makes cudaMemcpy2DAsync slower and greatly increases kernel launch overhead.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 22, 2023

On 2x3090 it improved top t/s beyond exllama. I have actual nvlink. 13t/s in textgen vs 10t/s. For prompt processing, it stopped being noticeable on at least mid length contexts (1-2k). Time to first token was making me want to go back to GPTQ.

Did not try to split a model across all 3 of my cards yet post change. I've got the same exact xeon v4 as you but all layers are offloaded.

@JohannesGaessler
Copy link
Collaborator Author

I currently don't have GGUF models higher than 7b ready to re-test performance but for 7b the relative performance after rebasing seems to be the same.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 27, 2023

Was waiting on llama python bindings to catch up to test in the new implementation.

When I ran 70b Q6 I added a string that peering was enabled. It enabled for the 3090s but not the P40. Still, speed for a 3 card model of 2 different generations isn't bad. This cannot be done with GPTQ without 3x3090.

@ghost
Copy link

ghost commented Aug 30, 2023

Greetings, from my understanding of this thread, the implementation slows down prompt processing but does in fact enable loading larger models into pooled VRAM. IMO this is a valuable feature, maybe enable merge depending on a CMAKE flag?

@cebtenzzre
Copy link
Collaborator

does in fact enable loading larger models into pooled VRAM

Multi-GPU is already supported by llama.cpp. In this PR's current state, it's entirely a performance tradeoff between token generation and prompt processing when using multiple GPUs.

@JohannesGaessler
Copy link
Collaborator Author

As I said in the issue: the proper way to implement this would be to not just enable peer access but to then also adapt ggml_cuda_op. If this also fixes the prompt processing speed we don't need to add a compile option and can just enable peer access unconditionally if it is supported. I want to do this at some point but since you said "Willing and able to implement anything required to enabled pooled VRAM. Looking for direction and suggested starting points to generate the PR." I would also be fine with letting you take a crack at it.

@ghost
Copy link

ghost commented Aug 30, 2023

I would have to spin up on this specific (probably worth doing anyway?), if you already know what to do by all means its yours

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 30, 2023

I do not get any kind of processing slowdown with proper nvlink. It just beats exllama, simple as. There is no tradeoff for me with fully offloaded models that I can tell. I rather get 17t/s than 10t/s. If there is a slowdown it will happen to 220+ t/s prompt processing and I'll take it.

I am really unsure why people think there is pooled memory. Even with Nvlink it's not pooled. Simply the cards skip the step of sending anything back to the CPU and communicate directly.

@ghost
Copy link

ghost commented Aug 30, 2023

I’m noobing here. Explain how to load the 70B model with full cuda offload to two 3090ti. What hardware is required to upgrade from a system with one 3090ti to match your results? Nvlink not needed?

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 30, 2023

I just tell it to offload 83 layers and then do the memory split like 41,42

You don't need nvlink but you'll get the 10 it/s rather than more. Definitely need 48GB and more for quants above 4KM. The highest I ran is Q6 but that requires using the P40 and then I only get 7 it/s

@ghost
Copy link

ghost commented Aug 31, 2023

What would be valuable additions if I request access to a100? How can I write a proposal for access based on this repo? Curious if it’s worth my time to try to get access to the big iron cluster or just add another card to my personal system

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 31, 2023

Which A100? If it's the 40gb you will come up short.

@ghost
Copy link

ghost commented Aug 31, 2023

Hoping to program Ascend for something entertaining: https://www.osc.edu/resources/technical_support/supercomputers/ascend

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 31, 2023

You can train quite well there. I would leave inference to 2x3090 setups.Training 70b on only 48g is slow, even as Q4.

@ghost
Copy link

ghost commented Aug 31, 2023

So would you recommend the proposal output to be new model weights after training the base 70B model with more data? Trying to envision why a project about llama.cpp would get funded.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 31, 2023

I don't know what proposal you could write. If you had an idea for a dataset to produce some kind of model I'm sure they would give you compute time. But just to run inference I think you are better off buying another card.

Off the top of my head I would combine teatime+todd proxy outputs with first cleaned orca, guanaco and then airoboros to see which one would work better. But I doubt someone would fund my pet anti-alignment projects.

@JohannesGaessler
Copy link
Collaborator Author

So I tried an implementation where GPUs directly read to/write from the VRAM of other GPUs but as it turns out that is slower than copying in one batch at the beginning or the end. So I think the way to go is to add a toggle (with enabled peer access being the default).

More generally, it seems that the biggest source of kernel launch overhead is the cuBLAS GEMM for the KV cache (due to the many attention heads). For token generation the overhead is much lower because the kernels I wrote can process all attention heads at once and I think that's why enabling peer access makes it faster.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 1, 2023

You did something because exllama also enables peer access and it's slower.

@JohannesGaessler
Copy link
Collaborator Author

After #3110 I finally understand the prompt processing performance regression from enabling peer access. The problem is that by default the data goes device -> host -> device with buffering on the host. If you enable peer access the data goes directly device -> device with no buffering so for 2+ GPUs communicating with the main device they can block each others' data transfer and the performance goes down.

Unfortunately you apparently cannot just toggle peer to peer data transfer on a per-API-call basis though. The proper way to fix this would be to reduce the bandwidth needed for data transfers (e.g. by transferring f16 or q8_0) or by manually implementing a buffered device -> host -> device data transfer for the writeback to the main GPU. But for now I think we can add a function that sets peer access depending on batch size and NVLink availability and call it at the beginning of each eval.

@JohannesGaessler
Copy link
Collaborator Author

The proper way to fix this would be to reduce the bandwidth needed for data transfers (e.g. by transferring f16 or q8_0) or by manually implementing a buffered device -> host -> device data transfer for the writeback to the main GPU.

Actually, now that I think about it it may be possible to fix the performance by reordering the data transfers via CUDA events.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 10, 2023

Won't this do nothing as is because #ifdef NDEBUG

I took this out when testing.

@JohannesGaessler
Copy link
Collaborator Author

NDEBUG is defined unless you enable debugging, so by default it should do something.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 10, 2023

Thanks. I was not aware.

@JohannesGaessler
Copy link
Collaborator Author

I tried an implementation where I used CUDA events to enforce a specific data transfer order but the performance was still worse than with peer access enabled. I now did an implementation that at the beginning of an eval enables or disables peer access based on batch size. The threshold for peer access can be controlled via a compile option LLAMA_CUDA_PEER_MAX_BATCH_SIZE and the default is 128. I am basing this value on the following results that I got with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=512:

model test n_batch t/s master t/s PR Speedup
LLaMA 7B mostly Q4_0 pp 512 1 49.21 ± 0.04 67.04 ± 0.11 1.36
LLaMA 7B mostly Q4_0 pp 512 2 31.71 ± 0.01 35.55 ± 0.01 1.12
LLaMA 7B mostly Q4_0 pp 512 4 61.18 ± 0.01 68.49 ± 0.02 1.12
LLaMA 7B mostly Q4_0 pp 512 8 108.74 ± 0.02 119.72 ± 0.04 1.10
LLaMA 7B mostly Q4_0 pp 512 16 193.34 ± 0.08 214.39 ± 0.09 1.11
LLaMA 7B mostly Q4_0 pp 512 32 321.10 ± 0.06 354.31 ± 0.22 1.10
LLaMA 7B mostly Q4_0 pp 512 64 478.07 ± 0.16 524.90 ± 0.23 1.10
LLaMA 7B mostly Q4_0 pp 512 128 588.65 ± 0.21 638.28 ± 0.28 1.08
LLaMA 7B mostly Q4_0 pp 512 256 744.01 ± 0.43 595.75 ± 0.06 0.80
LLaMA 7B mostly Q4_0 pp 512 512 904.99 ± 0.37 708.86 ± 0.62 0.78
LLaMA 7B mostly Q4_0 tg 128 1 52.95 ± 0.02 71.53 ± 0.07 1.35

With NVLink enabled a higher value would probably perform better but unfortunately CUDA does not seem to let you query NVLink status between devices.

@slaren
Copy link
Collaborator

slaren commented Sep 16, 2023

Could this be done without adding a new API? Maybe in ggml_cuda_mul_mat by checking the dimensions of the tensors.

ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
@Ph0rk0z
Copy link

Ph0rk0z commented Sep 17, 2023

I've been using nbatch of 512 and it definitely speeds up with nvlink. If it turned it off unless batch is 128 or less then it would basically never use it.

What I am wondering is what this does with P40 because I now have 3 of them and peer access never enables between them in linux (i added a print string that it succeeded). Only see it on 3090s.

@JohannesGaessler JohannesGaessler merged commit 111163e into ggerganov:master Sep 17, 2023
33 checks passed
@JohannesGaessler
Copy link
Collaborator Author

I've been using nbatch of 512 and it definitely speeds up with nvlink. If it turned it off unless batch is 128 or less then it would basically never use it.

If you compile with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=4096 NVLink should still be used on master. As I said, there unfortunately does not seem to be an easy way to query whether or not 2 GPUs are connected via NVLink so I chose the default for (what I think) is the more common case that people are running Linux.

What I am wondering is what this does with P40 because I now have 3 of them and peer access never enables between them in linux (i added a print string that it succeeded). Only see it on 3090s.

Could be an issue with the motherboard having to support it or NUMA affinity for dual CPU systems, I don't know.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 17, 2023

There are at least 2 on one numa node. The MB is designed for P40 support. Probably why adding the printf would have been good here because then you know. If any got enabled. In any case I'll compile it with the high batch and see what happens and if it's all as before.

@JohannesGaessler
Copy link
Collaborator Author

There are at least 2 on one numa node.

If those 2 are on a different NUMA node than the one set as main device peer access can still not be enabled. Only main device <-> other devices peer access is ever enabled.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 17, 2023

So then I can either enable nvlink for the 3090s or put all the P40s on one proc and set one as main to enable access between them?

And peer access won't work between the 3090s and the single P40 via PCIE despite being on the same numa because of generation difference?

I just tested this with some print strings and access enables like before when I pumped the max to 4096. This time it turns on at first generation rather than at initialization.

@sgoll
Copy link

sgoll commented Sep 22, 2023

For some reason, this becomes unbearably slow when using two A6000 from vast.ai. This is not fixed by the follow-up PR #3231. I'm not sure what the reason for this might be.1

The slowdown happens only when I activate both A6000. If I set CUDA_VISIBLE_DEVICES to run only on a single card, everything seems fine. If I go to a commit before this PR was merged, everything is fine as well.

This is the output from nvidia-smi:

$ nvidia-smi topo -p2p r
        GPU0    GPU1
 GPU0   X       OK
 GPU1   OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

$ nvidia-smi nvlink -s
GPU 0: NVIDIA RTX A6000 (UUID: GPU-a1d15108-c4ec-4e7a-1641-ffdf3129e846)
NVML: Unable to retrieve NVLink information as all links are inActive
GPU 1: NVIDIA RTX A6000 (UUID: GPU-39c0997f-dc94-76d3-4c10-19dde04dcced)
NVML: Unable to retrieve NVLink information as all links are inActive

Is there a way to disable the new feature at runtime? If you need more information, please let me know. I'm using the Docker image nvidia/cuda:12.2.0-devel-ubuntu22.04.

Footnotes

  1. By unbearably slow, I mean inference goes from dozens of tokens per second (on 70B model) down to one token every 30 seconds!

@JohannesGaessler
Copy link
Collaborator Author

There is no option to disable it at runtime but you should be able to disable it by compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0.

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
@city96
Copy link

city96 commented Oct 18, 2023

I ran into the same thing as @sgoll today and managed to track it down to this commit. I'm using my own hardware but have both cards (P40s) passed to a linux VM instead of using them directly. My guess is that virtualization (HyperV, in my case) somehow interferes with the peer access logic. Maybe having multiple CPUs/NUMA nodes also contributes to it. Setting the peer access compile flag to zero fixes it as expected.

It seems like a pretty rare edge case, considering this is the only mention of this issue I saw. Still, might be worth adding a note about it to the LLAMA_CUDA_PEER_MAX_BATCH_SIZE line of the readme. Just my 2c.

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 19, 2023

Another edge case is when loading a llama.cpp model after using exllama. Peer access is already enabled and it goes down. But I just restart it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants