CUDA: enable peer access between devices #2470

JohannesGaessler · 2023-07-31T15:31:20Z

This PR enables peer access between CUDA devices if possible. As a consequence devices can communicate directly via PCIe instead of using the CPU as an intermediary. This makes token generation faster:

GPU	Model	Test	t/s 1x P40	t/s master	t/s PR	Speedup
3x P40	7b q4_0	tg128	50.29	44.48	56.23	1.26
3x P40	13b q4_0	tg128	27.86	30.32	37.46	1.24
3x P40	33b q4_0	tg128	12.11	15.85	18.54	1.17
3x P40	70b q6_K	tg128	-	7.36	8.15	1.11
3x P40	7b q4_0	pp	703.32	384.37	365.01	0.95
3x P40	13b q4_0	pp	384.52	243.87	224.95	0.92
3x P40	33b q4_0	pp	158.15	121.24	111.78	0.92
3x P40	70b q6_K	pp	-	58.15	54.86	0.94

However, for some reason that I don't yet understand it also makes prompt processing slightly slower. Peer access makes memory allocation slower but I don't think that is the cause. In any case, I think even with the decrease in prompt processing speed this would be a net positive.

JohannesGaessler · 2023-07-31T15:32:15Z

I forgot to add: this PR would also enable NVLink if someone were to use it (P40s do not have NVLink).

slaren · 2023-07-31T15:56:16Z

I tried this with 3090 Ti + 3080 on WSL2 and there isn't really a difference, however I am limited to 7B because it fails with "out of memory" errors for anything bigger than that. I suspect that this is because I cannot enable above 4G decoding with both GPUs on my system. The performance with 7B and the two GPUs is very bad for me, from 25 ms/token with a single GPU, to 170 ms/token with two GPUs.

I wonder if writing the result directly to memory of the main GPU in the kernel would be faster than the memcpy.

JohannesGaessler · 2023-07-31T16:05:50Z

Direct communication without NVLink via PCIe only works on Linux I think. You should be able to check this via nvidia-smi topo -p2p r.

slaren · 2023-07-31T16:09:17Z

Yes you're right, cudaDeviceCanAccessPeer returns 0. nvidia-smi says CNS = Chipset not supported, which again seems related to the above 4G decoding issue.

Using the Windows native version of nvidia-smi doesn't even recognize the argument, ERROR: Option topo is not recognized. Please run 'nvidia-smi -h'.. So I guess this is not going to work under Windows regardless.

JohannesGaessler · 2023-07-31T16:21:35Z

Some profiling data for perplexity calculations using 7b q4_0:

CUDA API call	Total no peer access	Per call no peer access	Total with peer access	Per call with peer access
cudaMemcpyAsync	11.798 s	2.362 ms	41.563 ms	8.322 µs
cudaLaunchKernel	546.796 ms	8.340 µs	7.537 s	114.965 µs
cudaMemcpy2DAsync	261.229 ms	9.503 µs	5.234 s	190.392 µs

It seems that enabling peer access makes cudaMemcpyAsync much faster but it also makes cudaMemcpy2DAsync slower and greatly increases kernel launch overhead.

Ph0rk0z · 2023-08-22T10:11:49Z

On 2x3090 it improved top t/s beyond exllama. I have actual nvlink. 13t/s in textgen vs 10t/s. For prompt processing, it stopped being noticeable on at least mid length contexts (1-2k). Time to first token was making me want to go back to GPTQ.

Did not try to split a model across all 3 of my cards yet post change. I've got the same exact xeon v4 as you but all layers are offloaded.

JohannesGaessler · 2023-08-27T15:11:56Z

I currently don't have GGUF models higher than 7b ready to re-test performance but for 7b the relative performance after rebasing seems to be the same.

Ph0rk0z · 2023-08-27T15:32:47Z

Was waiting on llama python bindings to catch up to test in the new implementation.

When I ran 70b Q6 I added a string that peering was enabled. It enabled for the 3090s but not the P40. Still, speed for a 3 card model of 2 different generations isn't bad. This cannot be done with GPTQ without 3x3090.

ghost · 2023-08-30T16:56:24Z

Greetings, from my understanding of this thread, the implementation slows down prompt processing but does in fact enable loading larger models into pooled VRAM. IMO this is a valuable feature, maybe enable merge depending on a CMAKE flag?

cebtenzzre · 2023-08-30T17:26:51Z

does in fact enable loading larger models into pooled VRAM

Multi-GPU is already supported by llama.cpp. In this PR's current state, it's entirely a performance tradeoff between token generation and prompt processing when using multiple GPUs.

JohannesGaessler · 2023-08-30T17:46:59Z

As I said in the issue: the proper way to implement this would be to not just enable peer access but to then also adapt ggml_cuda_op. If this also fixes the prompt processing speed we don't need to add a compile option and can just enable peer access unconditionally if it is supported. I want to do this at some point but since you said "Willing and able to implement anything required to enabled pooled VRAM. Looking for direction and suggested starting points to generate the PR." I would also be fine with letting you take a crack at it.

ghost · 2023-08-30T17:52:30Z

I would have to spin up on this specific (probably worth doing anyway?), if you already know what to do by all means its yours

Ph0rk0z · 2023-08-30T23:34:55Z

I do not get any kind of processing slowdown with proper nvlink. It just beats exllama, simple as. There is no tradeoff for me with fully offloaded models that I can tell. I rather get 17t/s than 10t/s. If there is a slowdown it will happen to 220+ t/s prompt processing and I'll take it.

I am really unsure why people think there is pooled memory. Even with Nvlink it's not pooled. Simply the cards skip the step of sending anything back to the CPU and communicate directly.

ghost · 2023-08-30T23:42:07Z

I’m noobing here. Explain how to load the 70B model with full cuda offload to two 3090ti. What hardware is required to upgrade from a system with one 3090ti to match your results? Nvlink not needed?

Ph0rk0z · 2023-08-30T23:55:27Z

I just tell it to offload 83 layers and then do the memory split like 41,42

You don't need nvlink but you'll get the 10 it/s rather than more. Definitely need 48GB and more for quants above 4KM. The highest I ran is Q6 but that requires using the P40 and then I only get 7 it/s

ghost · 2023-08-31T00:28:21Z

What would be valuable additions if I request access to a100? How can I write a proposal for access based on this repo? Curious if it’s worth my time to try to get access to the big iron cluster or just add another card to my personal system

Ph0rk0z · 2023-08-31T00:32:12Z

Which A100? If it's the 40gb you will come up short.

ghost · 2023-08-31T02:02:27Z

Hoping to program Ascend for something entertaining: https://www.osc.edu/resources/technical_support/supercomputers/ascend

Ph0rk0z · 2023-08-31T14:46:49Z

You can train quite well there. I would leave inference to 2x3090 setups.Training 70b on only 48g is slow, even as Q4.

ghost · 2023-08-31T14:48:59Z

So would you recommend the proposal output to be new model weights after training the base 70B model with more data? Trying to envision why a project about llama.cpp would get funded.

Ph0rk0z · 2023-08-31T15:00:44Z

I don't know what proposal you could write. If you had an idea for a dataset to produce some kind of model I'm sure they would give you compute time. But just to run inference I think you are better off buying another card.

Off the top of my head I would combine teatime+todd proxy outputs with first cleaned orca, guanaco and then airoboros to see which one would work better. But I doubt someone would fund my pet anti-alignment projects.

JohannesGaessler · 2023-08-31T20:04:26Z

So I tried an implementation where GPUs directly read to/write from the VRAM of other GPUs but as it turns out that is slower than copying in one batch at the beginning or the end. So I think the way to go is to add a toggle (with enabled peer access being the default).

More generally, it seems that the biggest source of kernel launch overhead is the cuBLAS GEMM for the KV cache (due to the many attention heads). For token generation the overhead is much lower because the kernels I wrote can process all attention heads at once and I think that's why enabling peer access makes it faster.

Ph0rk0z · 2023-09-01T10:45:01Z

You did something because exllama also enables peer access and it's slower.

JohannesGaessler · 2023-09-10T16:51:46Z

After #3110 I finally understand the prompt processing performance regression from enabling peer access. The problem is that by default the data goes device -> host -> device with buffering on the host. If you enable peer access the data goes directly device -> device with no buffering so for 2+ GPUs communicating with the main device they can block each others' data transfer and the performance goes down.

Unfortunately you apparently cannot just toggle peer to peer data transfer on a per-API-call basis though. The proper way to fix this would be to reduce the bandwidth needed for data transfers (e.g. by transferring f16 or q8_0) or by manually implementing a buffered device -> host -> device data transfer for the writeback to the main GPU. But for now I think we can add a function that sets peer access depending on batch size and NVLink availability and call it at the beginning of each eval.

JohannesGaessler · 2023-09-10T17:11:00Z

The proper way to fix this would be to reduce the bandwidth needed for data transfers (e.g. by transferring f16 or q8_0) or by manually implementing a buffered device -> host -> device data transfer for the writeback to the main GPU.

Actually, now that I think about it it may be possible to fix the performance by reordering the data transfers via CUDA events.

Ph0rk0z · 2023-09-10T17:24:26Z

Won't this do nothing as is because #ifdef NDEBUG

I took this out when testing.

JohannesGaessler · 2023-09-10T17:32:13Z

NDEBUG is defined unless you enable debugging, so by default it should do something.

Ph0rk0z · 2023-09-10T17:53:38Z

Thanks. I was not aware.

JohannesGaessler · 2023-09-16T23:09:48Z

I tried an implementation where I used CUDA events to enforce a specific data transfer order but the performance was still worse than with peer access enabled. I now did an implementation that at the beginning of an eval enables or disables peer access based on batch size. The threshold for peer access can be controlled via a compile option LLAMA_CUDA_PEER_MAX_BATCH_SIZE and the default is 128. I am basing this value on the following results that I got with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=512:

model	test	n_batch	t/s master	t/s PR	Speedup
LLaMA 7B mostly Q4_0	pp 512	1	49.21 ± 0.04	67.04 ± 0.11	1.36
LLaMA 7B mostly Q4_0	pp 512	2	31.71 ± 0.01	35.55 ± 0.01	1.12
LLaMA 7B mostly Q4_0	pp 512	4	61.18 ± 0.01	68.49 ± 0.02	1.12
LLaMA 7B mostly Q4_0	pp 512	8	108.74 ± 0.02	119.72 ± 0.04	1.10
LLaMA 7B mostly Q4_0	pp 512	16	193.34 ± 0.08	214.39 ± 0.09	1.11
LLaMA 7B mostly Q4_0	pp 512	32	321.10 ± 0.06	354.31 ± 0.22	1.10
LLaMA 7B mostly Q4_0	pp 512	64	478.07 ± 0.16	524.90 ± 0.23	1.10
LLaMA 7B mostly Q4_0	pp 512	128	588.65 ± 0.21	638.28 ± 0.28	1.08
LLaMA 7B mostly Q4_0	pp 512	256	744.01 ± 0.43	595.75 ± 0.06	0.80
LLaMA 7B mostly Q4_0	pp 512	512	904.99 ± 0.37	708.86 ± 0.62	0.78
LLaMA 7B mostly Q4_0	tg 128	1	52.95 ± 0.02	71.53 ± 0.07	1.35

With NVLink enabled a higher value would probably perform better but unfortunately CUDA does not seem to let you query NVLink status between devices.

slaren · 2023-09-16T23:26:57Z

Could this be done without adding a new API? Maybe in ggml_cuda_mul_mat by checking the dimensions of the tensors.

ggml-cuda.cu

Ph0rk0z · 2023-09-17T14:03:26Z

I've been using nbatch of 512 and it definitely speeds up with nvlink. If it turned it off unless batch is 128 or less then it would basically never use it.

What I am wondering is what this does with P40 because I now have 3 of them and peer access never enables between them in linux (i added a print string that it succeeded). Only see it on 3090s.

JohannesGaessler · 2023-09-17T14:41:33Z

I've been using nbatch of 512 and it definitely speeds up with nvlink. If it turned it off unless batch is 128 or less then it would basically never use it.

If you compile with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=4096 NVLink should still be used on master. As I said, there unfortunately does not seem to be an easy way to query whether or not 2 GPUs are connected via NVLink so I chose the default for (what I think) is the more common case that people are running Linux.

What I am wondering is what this does with P40 because I now have 3 of them and peer access never enables between them in linux (i added a print string that it succeeded). Only see it on 3090s.

Could be an issue with the motherboard having to support it or NUMA affinity for dual CPU systems, I don't know.

Ph0rk0z · 2023-09-17T15:01:32Z

There are at least 2 on one numa node. The MB is designed for P40 support. Probably why adding the printf would have been good here because then you know. If any got enabled. In any case I'll compile it with the high batch and see what happens and if it's all as before.

JohannesGaessler · 2023-09-17T17:12:15Z

There are at least 2 on one numa node.

If those 2 are on a different NUMA node than the one set as main device peer access can still not be enabled. Only main device <-> other devices peer access is ever enabled.

Ph0rk0z · 2023-09-17T17:45:19Z

So then I can either enable nvlink for the 3090s or put all the P40s on one proc and set one as main to enable access between them?

And peer access won't work between the 3090s and the single P40 via PCIE despite being on the same numa because of generation difference?

I just tested this with some print strings and access enables like before when I pumped the max to 4096. This time it turns on at first generation rather than at initialization.

sgoll · 2023-09-22T21:04:56Z

For some reason, this becomes unbearably slow when using two A6000 from vast.ai. This is not fixed by the follow-up PR #3231. I'm not sure what the reason for this might be.¹

The slowdown happens only when I activate both A6000. If I set CUDA_VISIBLE_DEVICES to run only on a single card, everything seems fine. If I go to a commit before this PR was merged, everything is fine as well.

This is the output from nvidia-smi:

$ nvidia-smi topo -p2p r
        GPU0    GPU1
 GPU0   X       OK
 GPU1   OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

$ nvidia-smi nvlink -s
GPU 0: NVIDIA RTX A6000 (UUID: GPU-a1d15108-c4ec-4e7a-1641-ffdf3129e846)
NVML: Unable to retrieve NVLink information as all links are inActive
GPU 1: NVIDIA RTX A6000 (UUID: GPU-39c0997f-dc94-76d3-4c10-19dde04dcced)
NVML: Unable to retrieve NVLink information as all links are inActive

Is there a way to disable the new feature at runtime? If you need more information, please let me know. I'm using the Docker image nvidia/cuda:12.2.0-devel-ubuntu22.04.

By unbearably slow, I mean inference goes from dozens of tokens per second (on 70B model) down to one token every 30 seconds! ↩

JohannesGaessler · 2023-09-22T22:02:41Z

There is no option to disable it at runtime but you should be able to disable it by compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0.

city96 · 2023-10-18T18:04:18Z

I ran into the same thing as @sgoll today and managed to track it down to this commit. I'm using my own hardware but have both cards (P40s) passed to a linux VM instead of using them directly. My guess is that virtualization (HyperV, in my case) somehow interferes with the peer access logic. Maybe having multiple CPUs/NUMA nodes also contributes to it. Setting the peer access compile flag to zero fixes it as expected.

It seems like a pretty rare edge case, considering this is the only mention of this issue I saw. Still, might be worth adding a note about it to the LLAMA_CUDA_PEER_MAX_BATCH_SIZE line of the readme. Just my 2c.

Ph0rk0z · 2023-10-19T12:56:47Z

Another edge case is when loading a llama.cpp model after using exllama. Peer access is already enabled and it goes down. But I just restart it.

slaren approved these changes Aug 27, 2023

View reviewed changes

JohannesGaessler force-pushed the cuda-peer branch from 42c5d3c to 8649481 Compare August 27, 2023 15:02

JohannesGaessler mentioned this pull request Aug 30, 2023

Discussion: Requirements for NVLink of two 3090Ti for pooled 48gb VRAM #2907

Closed

Green-Sky mentioned this pull request Sep 11, 2023

Faster multi-gpu strategy? #3120

Closed

JohannesGaessler force-pushed the cuda-peer branch from 8649481 to 9ae9b86 Compare September 16, 2023 23:05

JohannesGaessler requested a review from slaren September 16, 2023 23:10

JohannesGaessler force-pushed the cuda-peer branch from 9ae9b86 to 212aab2 Compare September 16, 2023 23:44

slaren reviewed Sep 17, 2023

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Show resolved Hide resolved

JohannesGaessler force-pushed the cuda-peer branch from 212aab2 to 3f2b38f Compare September 17, 2023 11:02

CUDA: enable peer access between devices

02540ff

JohannesGaessler force-pushed the cuda-peer branch from 3f2b38f to 02540ff Compare September 17, 2023 13:11

slaren approved these changes Sep 17, 2023

View reviewed changes

JohannesGaessler merged commit 111163e into ggerganov:master Sep 17, 2023
33 checks passed

quarterturn mentioned this pull request Sep 17, 2023

CUDA error 217 at ggml-cuda.cu:6292: peer access is not supported between these two devices #3230

Closed

JohannesGaessler mentioned this pull request Sep 17, 2023

CUDA: fix peer access logic #3231

Merged

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023

CUDA: enable peer access between devices (ggerganov#2470)

b9be3b6

sgoll mentioned this pull request Nov 3, 2023

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Closed

wookayin mentioned this pull request Nov 21, 2023

garbage output on small models spread to many GPUs ollama/ollama#961

Closed

wookayin mentioned this pull request Nov 28, 2023

Disable CUDA peer access as a workaround for multi-gpu inference bug ollama/ollama#1261

Merged

CUDA: enable peer access between devices #2470

CUDA: enable peer access between devices #2470

Conversation

JohannesGaessler commented Jul 31, 2023

JohannesGaessler commented Jul 31, 2023

slaren commented Jul 31, 2023

JohannesGaessler commented Jul 31, 2023

slaren commented Jul 31, 2023 • edited Loading

JohannesGaessler commented Jul 31, 2023 • edited Loading

Ph0rk0z commented Aug 22, 2023

JohannesGaessler commented Aug 27, 2023

Ph0rk0z commented Aug 27, 2023

ghost commented Aug 30, 2023

cebtenzzre commented Aug 30, 2023

JohannesGaessler commented Aug 30, 2023

ghost commented Aug 30, 2023

Ph0rk0z commented Aug 30, 2023

ghost commented Aug 30, 2023

Ph0rk0z commented Aug 30, 2023

ghost commented Aug 31, 2023

Ph0rk0z commented Aug 31, 2023

ghost commented Aug 31, 2023

Ph0rk0z commented Aug 31, 2023

ghost commented Aug 31, 2023

Ph0rk0z commented Aug 31, 2023

JohannesGaessler commented Aug 31, 2023

Ph0rk0z commented Sep 1, 2023

JohannesGaessler commented Sep 10, 2023

JohannesGaessler commented Sep 10, 2023

Ph0rk0z commented Sep 10, 2023

JohannesGaessler commented Sep 10, 2023

Ph0rk0z commented Sep 10, 2023

JohannesGaessler commented Sep 16, 2023

slaren commented Sep 16, 2023

Ph0rk0z commented Sep 17, 2023

JohannesGaessler commented Sep 17, 2023

Ph0rk0z commented Sep 17, 2023

JohannesGaessler commented Sep 17, 2023

Ph0rk0z commented Sep 17, 2023

sgoll commented Sep 22, 2023

Footnotes

JohannesGaessler commented Sep 22, 2023

city96 commented Oct 18, 2023

Ph0rk0z commented Oct 19, 2023

slaren commented Jul 31, 2023 •

edited

Loading

JohannesGaessler commented Jul 31, 2023 •

edited

Loading