Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add option to override model tensor buffers #11397

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

slaren
Copy link
Member

@slaren slaren commented Jan 24, 2025

Adds command line parameter --override-tensor (-ot) that allows changing the buffer type where a model tensor is allocated. This gives user fine grained control over what tensors are to offloaded to each device.

How is this useful: for example, to force the experts in MoE models to stay on the CPU, while offloading the rest to the GPU, you could use -ngl 99 -ot exps=CPU. This may allow more efficient offloading schemes.

The syntax is <tensor name pattern>=<buffer type>. Currently the pattern is just a string search (edit: this is no longer the case, it is a C++ regex search), ie. any tensors that contains the characters in <tensor name pattern> will be matched and loaded into the given buffer type. Multiple overrides can be given by separating them with commas, or passing the -ot option multiple times. To see what tensors are being matched, enable debugging output with -v.

At this point it is just a demo, feel free to experiment and report if you find any interesting uses.

Edit: added regex support, for example to keep experts of layers 20-99 in the CPU you could use -ot "[2-9][0-9]\.ffn_.*_exps\.=CPU"

TODO:

  • Fix pipeline parallelism check
  • Support overriding KV cache allocation

@slaren slaren added the demo Demonstrate some concept or idea, not intended to be merged label Jan 24, 2025
@slaren slaren changed the title llama : add option to override tensor buffers llama : add option to override model tensor buffers Jan 24, 2025
@slaren slaren added the need feedback Testing and feedback with results are needed label Jan 24, 2025
@bmtwl
Copy link
Contributor

bmtwl commented Jan 26, 2025

Is there a chance that the direction you're taking these changes might allow for scheduling specific threads to work on specific tensors? With R1 coming out, I'm very interested in reviving my work on trying to improve memory locality to increase CPU inference speeds.

@slaren
Copy link
Member Author

slaren commented Jan 26, 2025

No, that's something that would need to handled at a lower level in the CPU backend.

@bmtwl
Copy link
Contributor

bmtwl commented Jan 26, 2025

No, that's something that would need to handled at a lower level in the CPU backend.

Thanks for the reply @slaren. I figured it wouldn't directly help, but that maybe you'd be adding useful metadata to tensor objects that could help coordinate affinity in the future. I'll start a fresh branch and see how far I get.

At this point it is just a demo, feel free to experiment and report if you find any interesting uses.

I'll also try to pull this branch and test it to see what the speedup and sysmem savings look like.

@bmtwl
Copy link
Contributor

bmtwl commented Jan 27, 2025

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s
-ngl 10 = 5.15t/s
-ngl 20 = 5.64t/s
-ngl 30 = 6.10t/s
-ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

@saood06
Copy link

saood06 commented Jan 27, 2025

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

@bmtwl
Do you mind testing performance with -nkvo?

@jukofyork
Copy link
Contributor

What are the shared expert tensors called in llama.cpp - is there a pattern that catches the routed experts (that only activate 1/32 of the time), but doesn't catch the shared experts?

@slaren
Copy link
Member Author

slaren commented Jan 28, 2025

I believe the pattern exps will not match the shared experts, since they are called ffn_xxx_shexp.weight. You can use the gguf preview feature in huggingface to see the names of the tensors. Also remember that you can use multiple patterns, it doesn't have to be a single one.

@jukofyork
Copy link
Contributor

I believe the pattern exps will not match the shared experts, since they are called ffn_xxx_shexp.weight. You can use the gguf preview feature in huggingface to see the names of the tensors. Also remember that you can use multiple patterns, it doesn't have to be a single one.

Thanks - I'll give this a try later in the week.

This PR together with Reddit post opens up the interesting possibility:

https://old.reddit.com/r/LocalLLaMA/comments/1ibbloy/158bit_deepseek_r1_131gb_dynamic_gguf/

of quantising up/gate projections to q2_k and down projections to q4_k (or something similar), then keeping everything else as q8_0.

Sadly I need to move some stuff about to get space to upscale the fp8 download to bf16 before I can try it, but will report back when I do.

@jukofyork
Copy link
Contributor

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

It might be worth trying q4_0 as should almost let you offload all the layers and IIRC should be slightly faster to dequantise than the K-quants?

@jukofyork
Copy link
Contributor

Is there a chance that the direction you're taking these changes might allow for scheduling specific threads to work on specific tensors? With R1 coming out, I'm very interested in reviving my work on trying to improve memory locality to increase CPU inference speeds.

Just being able to split the experts between NUMA nodes would make a big difference, but not sure how easy that would be as IIRC the experts' tensors are all in one huge tensor now?

@BarfingLemurs
Copy link
Contributor

During normal operation, When I fit a model between ram and vram, Does the offloading follow a set layer sequence? (layer 0 is chosen first to be offloaded to GPU, then layer 1, etc)

Between GPU offloading and ram, which takes priority?

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

Do you remember how much of a speedup? No need for extensive benchmarks, just the rough % estimate.

@saood06
Copy link

saood06 commented Feb 2, 2025

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

I can't seem to offload more than 29 layers of R1 (unsloth's UD-IQ2_XXS) via RPC. 29 layers and below work fine, but 30 just crashes my rpc_server, with no error output. It is not an issue of VRAM as even setting context very low so that it takes up nowhere near my GPU's limits and it still crashes.

@jukofyork
Copy link
Contributor

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:
-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

I can't seem to offload more than 29 layers of R1 (unsloth's UD-IQ2_XXS) via RPC. 29 layers and below work fine, but 30 just crashes my rpc_server, with no error output. It is not an issue of VRAM as even setting context very low so that it takes up nowhere near my GPU's limits and it still crashes.

I had a similar problem where if I used a single GPU (via CUDA_VISIBLE_DEVICES=0) it ran fine and if I used both GPUs with the --no-kv-offload option it also ran fine (but much slower).

If I didn't use either of these it tried to allocate this 1.4TB monster buffer:

llama_init_from_model: pipeline parallelism enabled (n_copies=4)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1407257.91 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1475616865280
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 351268.28 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 368331484928
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 353465.98 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 370635939584

After some searching I found this issue:

#7217

and recompiled using -DGGML_SCHED_MAX_COPIES=1 and now it's working fine.

(It's likely nothing to do with this PR, but thought it might help!)

@jukofyork
Copy link
Contributor

@saood06

I figured it out: you have to reorder the devices so the local CUDA devices are last::

#11606
#11424

and mainly these:

#11435

You don't need to run RPC servers for local devices.

#9296
#11424

For those that don't get it (like me initially), you first need to check the device names using the --list-devices option (example below):

 $ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX XXXX, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce GTX YYYY, compute capability 7.5, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX XXXX (A MiB, B MiB free)
  CUDA1: NVIDIA GeForce GTX YYYY (A MiB, B MiB free)
  RPC[IP1:PORT1]: RPC[IP1:PORT1] (A MiB, B MiB free)
  RPC[IP2:PORT2]: RPC[IP2:PORT2] (A MiB, B MiB free)

It is under Available devices where you get the device names. Next time you launch llama-server, you will use the --device option with the order you want for your devices. An example:

$ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> \
--device RPC[IP1:PORT1],CUDA0,CUDA1,RPC[IP2:PORT2] \
-ngl 33 --tensor_split 3/20/10/0 --device-draft CUDA1,RPC[IP2:PORT2] -ngld 99 [...]

This way, you can set up the order however you want. In the complicated example above, the main model is offloaded to the first RPC device (using IP1:PORT1 address), mostly on the CUDA0 device, and partially to the CUDA1 device, while the draft model is offloaded to the CUDA1 device and the second RPC device (using IP2:PORT2 address).

Means this works:

--device "RPC[IP1:PORT1],RPC[IP1:PORT2],RPC[IP1:PORT1],RPC[IP2:PORT2],CUDA0,CUDA1"

But if I don't do this I get OOM errors with plenty of VRAM left like you had.

@saood06
Copy link

saood06 commented Feb 5, 2025

I'm testing this with and without #11446 and without on unsloth's UD-IQ2_XXS I was only able to offload 29 layers, and with I was able to allocate only 28 (on a Q4_K_S quant). This is not a VRAM issue, it would have plenty of spare VRAM, it would even get past allocation, and get to warmup, where the rpc-server would then just crash.

The other issue is performance the more layers I allocate the worse performance gets while bmtwl shows performance increase with more layers offloaded with non-RPC based offloading.

@ro99
Copy link

ro99 commented Feb 5, 2025

I am able to load the model with llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --threads 28 --host 0.0.0.0 --port 5001 -c 8192 -ngl 99 -ot exps=CPU :

PID DEV TYPE GPU MEM HOST MEM Command
16431 0 Compute 13294MiB 54% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000
16431 2 Compute 12088MiB 49% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000
16431 3 Compute 11616MiB 47% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000
16431 1 Compute 11488MiB 47% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000

But as soon as I send the prompt I receive:

/opt/llama.cpp/ggml/src/ggml-alloc.c:182: not enough space in the buffer
ggml_dyn_tallocr_alloc: not enough space in the buffer to allocate 18446744073709550624 bytes, largest block available 9223372036854775807 bytes
[New LWP 16444]
[New LWP 16445]
[New LWP 16446]
[New LWP 16447]
...
[New LWP 16533]
[New LWP 16534]
[New LWP 16535]
[New LWP 16536]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f1e950d0bd7 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f1e950d0bd7 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1e95527fc1 in ggml_abort () from /opt/llama.cpp/build/bin/libggml-base.so
#2  0x00007f1e9553619c in ggml_gallocr_allocate_node () from /opt/llama.cpp/build/bin/libggml-base.so
#3  0x00007f1e955369d0 in ggml_gallocr_reserve_n () from /opt/llama.cpp/build/bin/libggml-base.so
#4  0x00007f1e9553c244 in ggml_backend_sched_alloc_graph () from /opt/llama.cpp/build/bin/libggml-base.so
#5  0x00007f1e95646030 in llama_decode_impl(llama_context&, llama_batch) () from /opt/llama.cpp/build/bin/libllama.so
#6  0x00007f1e95646f57 in llama_decode () from /opt/llama.cpp/build/bin/libllama.so
#7  0x000055f47d6647c9 in server_context::update_slots() ()
#8  0x000055f47d64f4d1 in server_queue::start_loop() ()
#9  0x000055f47d5fd067 in main ()
[Inferior 1 (process 16431) detached]
Aborted (core dumped)

Without the --override-tensor and offloading 20 layers to the GPU it works fine.

Testing with 4x RTX 3090 and 320GiB RAM. Built with cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1.

@jukofyork
Copy link
Contributor

Without the --override-tensor and offloading 20 layers to the GPU it works fine.

Testing with 4x RTX 3090 and 320GiB RAM. Built with cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1.

Maybe try -ngl 61 to keep the output layer on the CPU too (that oddly worked for me earlier when I was having trouble with the RPC stuff).

@ro99
Copy link

ro99 commented Feb 5, 2025

Maybe try -ngl 61 to keep the output layer on the CPU too (that oddly worked for me earlier when I was having trouble with the RPC stuff).

No luck, still the same issue.

Oddly enough, the issue only happens when sending more than 450 tokens.

@slaren
Copy link
Member Author

slaren commented Feb 5, 2025

ggml_dyn_tallocr_alloc: not enough space in the buffer to allocate 18446744073709550624 bytes

It's trying to allocate a tensor of size 2^64, which suggest there is an integer overflow somewhere. If you set the environment variable GGML_SCHED_DEBUG=2, it will print the graph before allocating it, which may give some indication of which tensor is causing this. Or just change the error message in ggml_dyn_tallocr_alloc to include the tensor name.

@ro99
Copy link

ro99 commented Feb 6, 2025

It's trying to allocate a tensor of size 2^64, which suggest there is an integer overflow somewhere. If you set the environment variable GGML_SCHED_DEBUG=2, it will print the graph before allocating it, which may give some indication of which tensor is causing this. Or just change the error message in ggml_dyn_tallocr_alloc to include the tensor name.

It is the CPU#ffn_moe_topk-60#0 tensor.

Is it possible to try to force this particular one to be allocated into the GPU buffer?

@slaren
Copy link
Member Author

slaren commented Feb 6, 2025

This is most likely a bug, we need to understand why it is happening and fix it. Since you mentioned that it only happens with large prompts, I suspect that this is caused by a zero-sized tensors. When evaluating a batch where no logits are required (which happens when evaluating a prompt that needs to be split into multiple ubatches), zero-size tensors are created to skip the calculation of the logits.
I cannot run this model, so I would need your help to figure why this is happening. Can you print more details about the tensor? Something like this should do it:

diff --git a/ggml/src/ggml-alloc.c b/ggml/src/ggml-alloc.c
index 9a3bf9f29..470ef13e6 100644
--- a/ggml/src/ggml-alloc.c
+++ b/ggml/src/ggml-alloc.c
@@ -179,6 +179,9 @@ static size_t ggml_dyn_tallocr_alloc(struct ggml_dyn_tallocr * alloc, size_t siz
             // this should never happen
             GGML_LOG_ERROR("%s: not enough space in the buffer to allocate %zu bytes, largest block available %zu bytes\n",
                     __func__, size, max_avail);
+            GGML_LOG_ERROR("%s: tensor: %s, shape: %ld %ld %ld %ld, size: %zu",
+                __func__, tensor->name, tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3],
+                ggml_nbytes(tensor));
             GGML_ABORT("not enough space in the buffer");
         }
     }

@slaren
Copy link
Member Author

slaren commented Feb 6, 2025

Ok nvm, I think I see the problem. I will push a possible fix soon.

@jukofyork
Copy link
Contributor

I've got it working:

numactl --interleave=all ./llama.cpp/build/bin/llama-server --host 192.168.1.111 --port 8080 \
  --model ./DeepSeek-R1-mla-Q5_K_XL.gguf --chat-template deepseek3 --alias "DeepSeek-R1-mla-Q5_K_XL" --ctx_size 32768 \
  --n-gpu-layers 62 --numa distribute --threads 30 \
  --temp 0.6 --min-p 0.0 --top-p 1.0 --top-k 0 --rpc 192.168.1.112:50050,192.168.1.112:50051,192.168.1.113:50050,192.168.1.113:50051 \
  --device "RPC[192.168.1.112:50050],RPC[192.168.1.112:50051],RPC[192.168.1.113:50050],RPC[192.168.1.113:50051],CUDA0,CUDA1" \
  --tensor-split 0,0,0,0,31,31 \
  --override-tensor 'blk\.([3-8])\..*_exps\.=RPC[192.168.1.112:50050]' \
  --override-tensor 'blk\.([9]|1[0-4])\..*_exps\.=RPC[192.168.1.112:50051]' \
  --override-tensor 'blk\.(1[5-9]|20)\..*_exps\.=RPC[192.168.1.113:50050]' \
  --override-tensor 'blk\.(2[1-6])\..*_exps\.=RPC[192.168.1.113:50051]' \
  --override-tensor 'blk\.(2[7-9]|[3-5][0-9]|60)\..*_exps\.=CPU'
llama_kv_cache_init:      CUDA0 KV buffer size =  2108.01 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  2040.01 MiB
llama_init_from_model: KV self size  =    0.00 MiB, K (f16):    0.00 MiB, V (f16):    0.00 MiB
llama_init_from_model: KV self size  = 2196.00 MiB, K^R (f16):  244.00 MiB, c^KV (f16): 1952.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_init_from_model: RPC[192.168.1.112:50050] compute buffer size =   159.00 MiB
llama_init_from_model: RPC[192.168.1.112:50051] compute buffer size =   159.00 MiB
llama_init_from_model: RPC[192.168.1.113:50050] compute buffer size =   159.00 MiB
llama_init_from_model: RPC[192.168.1.113:50051] compute buffer size =   159.00 MiB
llama_init_from_model:      CUDA0 compute buffer size = 16731.50 MiB
llama_init_from_model:      CUDA1 compute buffer size = 16666.00 MiB
llama_init_from_model:        CPU compute buffer size =    78.01 MiB
llama_init_from_model: graph nodes  = 5208 (with bs=512), 5330 (with bs=1)
llama_init_from_model: graph splits = 183 (with bs=512), 119 (with bs=1)

I think this could be a super powerful command line option when mixed with RPC! Thanks for adding this!

If anybody has a Mac Studio they want to test this on then I can help craft the regexes to test it - I'm interested to see what sort of boost you could get without so many stages of latency.

@Dango233
Copy link

Dango233 commented Feb 10, 2025

I've got it working:

numactl --interleave=all ./llama.cpp/build/bin/llama-server --host 192.168.1.111 --port 8080 \
  --model ./DeepSeek-R1-mla-Q5_K_XL.gguf --chat-template deepseek3 --alias "DeepSeek-R1-mla-Q5_K_XL" --ctx_size 32768 \
  --n-gpu-layers 62 --numa distribute --threads 30 \
  --temp 0.6 --min-p 0.0 --top-p 1.0 --top-k 0 --rpc 192.168.1.112:50050,192.168.1.112:50051,192.168.1.113:50050,192.168.1.113:50051 \
  --device "RPC[192.168.1.112:50050],RPC[192.168.1.112:50051],RPC[192.168.1.113:50050],RPC[192.168.1.113:50051],CUDA0,CUDA1" \
  --tensor-split 0,0,0,0,31,31 \
  --override-tensor 'blk\.([3-8])\..*_exps\.=RPC[192.168.1.112:50050]' \
  --override-tensor 'blk\.([9]|1[0-4])\..*_exps\.=RPC[192.168.1.112:50051]' \
  --override-tensor 'blk\.(1[5-9]|20)\..*_exps\.=RPC[192.168.1.113:50050]' \
  --override-tensor 'blk\.(2[1-6])\..*_exps\.=RPC[192.168.1.113:50051]' \
  --override-tensor 'blk\.(2[7-9]|[3-5][0-9]|60)\..*_exps\.=CPU'
llama_kv_cache_init:      CUDA0 KV buffer size =  2108.01 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  2040.01 MiB
llama_init_from_model: KV self size  =    0.00 MiB, K (f16):    0.00 MiB, V (f16):    0.00 MiB
llama_init_from_model: KV self size  = 2196.00 MiB, K^R (f16):  244.00 MiB, c^KV (f16): 1952.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_init_from_model: RPC[192.168.1.112:50050] compute buffer size =   159.00 MiB
llama_init_from_model: RPC[192.168.1.112:50051] compute buffer size =   159.00 MiB
llama_init_from_model: RPC[192.168.1.113:50050] compute buffer size =   159.00 MiB
llama_init_from_model: RPC[192.168.1.113:50051] compute buffer size =   159.00 MiB
llama_init_from_model:      CUDA0 compute buffer size = 16731.50 MiB
llama_init_from_model:      CUDA1 compute buffer size = 16666.00 MiB
llama_init_from_model:        CPU compute buffer size =    78.01 MiB
llama_init_from_model: graph nodes  = 5208 (with bs=512), 5330 (with bs=1)
llama_init_from_model: graph splits = 183 (with bs=512), 119 (with bs=1)

I think this could be a super powerful command line option when mixed with RPC! Thanks for adding this!

If anybody has a Mac Studio they want to test this on then I can help craft the regexes to test it - I'm interested to see what sort of boost you could get without so many stages of latency.

I'm up for the testing - I have a mac studio M2 ultra 192GB <---10Gbps---> 13700K+192GBDDR5+RTX6000ada.
I'll try running this myself first and see if I can get it rolling

Also if its helpful (seems to be?) I can get a Thunderbolt Gen4 egpu case and plug my RTX6000ada there...

@jukofyork
Copy link
Contributor

Also if its helpful (seems to be?) I can get a Thunderbolt Gen4 egpu case and plug my RTX6000ada there...

It didn't help me due the latency between the parts all pushing the hidden state.

I used 10gbit Ethernet for all the machines so not sure upping to 40gbit (or whatever Thunderbolt is) will make that much difference - I think the problem is latency rather than bandwidth for this part sadly.

Possibly using InfiniBand might help as IIRC it has lower latency, but not sure.

I think the eventual solution would be to have RPC use a better method of pipeline parallelism like Deepspeed:

deepspeedai/DeepSpeed#1110

It would definitely help the batch processing, and mixed data and pipeline would remove some latency if multiple GPUs per machine like I have.

@Dango233
Copy link

Just figured egpu won't help as Apple silicon cannot run cuda...
Not sure if RPC is the bottle neck here - my RTX got maxed out - probably due to the lack of flash attention?

@saood06
Copy link

saood06 commented Feb 10, 2025

Not sure if RPC is the bottle neck here - my RTX got maxed out - probably due to the lack of flash attention?

Can you post speeds (with whatever configurations you tested), also not sure how much flash attention would impact speed, but it would shrink that compute buffer.

@jukofyork
Copy link
Contributor

jukofyork commented Feb 10, 2025

Not sure if RPC is the bottle neck here - my RTX got maxed out - probably due to the lack of flash attention?

Can you post speeds (with whatever configurations you tested), also not sure how much flash attention would impact speed, but it would shrink that compute buffer.

I think the RPC stuff is never really gonna work properly until it can do async buffering: the way it is set up now each stage in the pipeline is stalling for every communication and this adds the full latency. If it was async and buffered the next job would start almost immediately with no wait, and you could probably optimise this even more by having the fastest devices at the start of the pipeline and the slowest at the end to get almost no degradation from latency.

@Dango233
Copy link

Not sure if RPC is the bottle neck here - my RTX got maxed out - probably due to the lack of flash attention?

Can you post speeds (with whatever configurations you tested), also not sure how much flash attention would impact speed, but it would shrink that compute buffer.

The GPU ultilization could be an illussion. I'll try get some numbers across different setup.

@abc-nix
Copy link

abc-nix commented Feb 15, 2025

Many many thanks, @slaren, for this PR. I really hope it gets merged.

I have used this --override-tensor option to improve over 70% token generation speeds for Mixtral 8x22.

What I have learned so far (don't know if it is applicable for R1):

  • On Mixtral, there are 3 types of expert related tensors: ffn_gate_exps, ffn_up_exps and ffn_down_exps.
  • Try to offload as many layers as possible to GPU by keeping all expert related tensors on CPU (as explained in the merge description, -ot exps=CPU).
  • In most cases, you will have to keep the last layer on CPU (for mixtral 8x22 q4_k_m, that is 56 of 57 offloaded to GPU, the last layer will not fit).
  • Once you have all (minus one) layers offloaded to GPU, this is the order I found that improves token generation the most:
    1. Try to get as many ffn_down_exps tensors as possible on GPU. This means you need to override the ffn_gate_exps tensors and ffn_up_exps tensors and keep them on CPU (-ot ffn_gate_exps=CPU -ot ffn_up_exps=CPU)
    2. Once full, start offloading the ffn_up_exps tensors to GPU.
    3. Finally, offload any ffn_gate_exps tensors to GPU until full.

This is the order I found best improves token generation for mixtral on my system. This is with non-RPC devices (cuda), and may not correspond in the same way with other kind of backends.

Many thanks again for this PR.

@saood06
Copy link

saood06 commented Feb 15, 2025

I have used this --override-tensor option to improve over 70% token generation speeds for Mixtral 8x22.

If you don't mind can you post some more info:
How much VRAM/What GPU? Also what quant did you use for this? Can you post the actual performance numbers with the configurations you tested?

@abc-nix
Copy link

abc-nix commented Feb 16, 2025

I have used this --override-tensor option to improve over 70% token generation speeds for Mixtral 8x22.

If you don't mind can you post some more info: How much VRAM/What GPU? Also what quant did you use for this? Can you post the actual performance numbers with the configurations you tested?

Sure. I will try to summarize.

Note 1: It is no longer a 70% improvement, as I have been able to fit two extra layers by using q8 KV cache. It is now "only" a 60% improvement.

Note 2: When I say "All layers" I mean 56 layers. The last layer is always kept on CPU.

Note 3: All tests were performed with llama-server, with a long prompt of about 18K tokens.


Model and system information:

  • Model: WizardLM-2-8x22B.i1-Q4_K_M.gguf (mradermacher repo)
  • System: i7-10700 with 128GiB RAM (DDR4- 3867 MT/s) - Debian 12
  • GPUs:
    • RTX 3090 (CUDA0, 24 GiB VRAM) - PCIe 16x, power limit 220W
    • RTX 3090 (CUDA1, 24 GiB VRAM) - PCIe 4x, power limit 200 W
    • GTX 1660 Super (CUDA2, 6 GiB VRAM) - PCIe 1x
  • Context window: 32K

Summary table:

Baseline (normal layer offloading) With override tensors (best) Improvement (percentile)
Prompt Processing (T/s) 100.03 T/s 74.39 T/s -26%
Token Generation (T/s) 2.08 T/s 3.45 T/s +66%
CUDA0 (layers) 15 35
CUDA1 (layers) 15 21
CUDA2 (layers) 3 0

TG improves, but PP is a bit worse (but acceptable). More details below.

Command and results

Baseline (normal layer offloading):

  • Layers offloaded: 33/57 layers (15 layers on RTX 3090 + 15 layers on RTX 3090 + 3 layers on GTX 1060 Super)

Partial command:

GPU_LAYERS="-fa -ngl 33 -b 1024 -ub 512 --tensor_split 15/15/3 -ctv q8_0 -ctk q8_0"

Raw results:

prompt eval time =  180153.06 ms / 18020 tokens (   10.00 ms per token,   100.03 tokens per second)
       eval time =   95548.09 ms /   199 tokens (  480.14 ms per token,     2.08 tokens per second)
      total time =  275701.15 ms / 18219 tokens

With override tensors (best):

  • Layers offloaded: 56/57 (35 layers on RTX 3090 + 21 layers on RTX 3090 + 0 layers on GTX 1060 Super)
  • ffn_down_exps tensors offloaded fully to CUDA0 and CUDA1 (automatic with layers)
  • 33/56 ffn_up_exps tensors offloaded to GPU (0+21+12)
  • ffn_gate_exps kept on CPU

Partial command:

GPU_LAYERS="-fa -ngl 56 -b 1024 -ub 512 --tensor_split 35/21/0 \
-ot ([2][3-9]|[3][0-9]|[4][0-3]).ffn_up_exps=CUDA1 \
-ot ([4][4-9]|[5][0-9]).ffn_up_exps=CUDA2 \
-ot ffn_gate_exps=CPU -ot ffn_up_exps=CPU -ctv q8_0 -ctk q8_0"

Raw results:

prompt eval time =  242236.42 ms / 18020 tokens (   13.44 ms per token,    74.39 tokens per second)
       eval time =   62097.34 ms /   214 tokens (  290.17 ms per token,     3.45 tokens per second)
      total time =  304333.76 ms / 18234 tokens

Experiments regarding exps tensors "importance"

These are the tests I performed to find what exps tensors should be offloaded to GPU first, and which can be kept on CPU. My aim is to improve Token Generation/Prediction.

During these experiments:

  • Only test with main 2xRTX 3090 GPUs to reduce wasted time (using GTX 1060 slows down PP).
  • All layers (56/57) are offloaded to GPU (34 to CUDA 0, and 22 to CUDA 1)
  • Only offload 1 group of exps tensors each time to GPU and measure speed.

Summary table of offloaded exps tensors to GPU

no exps only ffn_gate_exps only ffn_up_exps only ffn_down_exps
Prompt Processing (T/s) 65.36 74.69 80.29 93.64
Token Generation (T/s) 1.84 2.31 2.34 2.54

Best was offloading the ffn_down_expstensors, followed by the ffn_up_exps tensors. Raw details below

Experiment details

Baseline (all exps tensors on CPU):

Partial command:

GPU_LAYERS="-fa -ngl 56 -b 1024 -ub 512 --tensor_split 34/22/0 -ot exps=CPU -ctv q8_0 -ctk q8_0"

Raw results:

prompt eval time =  275711.20 ms / 18020 tokens (   15.30 ms per token,    65.36 tokens per second)
       eval time =  148688.36 ms /   273 tokens (  544.65 ms per token,     1.84 tokens per second)
      total time =  424399.56 ms / 18293 tokens

Offloading ffn_gate_exps to GPU:

Test with all layers offloaded to GPU (automatic for ffn_gate_exps), keeping ffn_up_exps and ffn_down_exps tensors on CPU.

Partial command:

GPU_LAYERS="-fa -ngl 56 -b 1024 -ub 512 --tensor_split 34/22/0 \
-ot ffn_up_exps=CPU -ot ffn_down_exps=CPU -ctv q8_0 -ctk q8_0"

Raw results:

prompt eval time =  241260.09 ms / 18020 tokens (   13.39 ms per token,    74.69 tokens per second)
       eval time =  133838.82 ms /   309 tokens (  433.14 ms per token,     2.31 tokens per second)
      total time =  375098.92 ms / 18329 tokens

Offloading ffn_up_exps to GPU:

Test with all layers offloaded to GPU (automatic for ffn_up_exps), keeping ffn_gate_exps and ffn_down_exps tensors on CPU.

Partial command:

GPU_LAYERS="-fa -ngl 56 -b 1024 -ub 512 --tensor_split 34/22/0 \
-ot ffn_down_exps=CPU -ot ffn_gate_exps=CPU -ctv q8_0 -ctk q8_0"

Raw results:

prompt eval time =  224444.22 ms / 18020 tokens (   12.46 ms per token,    80.29 tokens per second)
       eval time =  116046.18 ms /   272 tokens (  426.64 ms per token,     2.34 tokens per second)
      total time =  340490.41 ms / 18292 tokens

Offloading ffn_down_exps to GPU:

Test with all layers offloaded to GPU (automatic for ffn_down_exps), keeping ffn_gate_exps and ffn_up_exps tensors on CPU.

Partial command:

GPU_LAYERS="-fa -ngl 56 -b 1024 -ub 512 --tensor_split 34/22/0 \
-ot ffn_up_exps=CPU -ot ffn_gate_exps=CPU -ctv q8_0 -ctk q8_0"

Raw results:

prompt eval time =  192436.88 ms / 18020 tokens (   10.68 ms per token,    93.64 tokens per second)
       eval time =  102894.79 ms /   261 tokens (  394.23 ms per token,     2.54 tokens per second)
      total time =  295331.68 ms / 18281 tokens

(Summary table moved to top of section)

@Reactantvr
Copy link

Reactantvr commented Feb 16, 2025

I thought I would give this a test, but maybe I am doing something wrong. It seems to give me no change in performance.

My system is a threadripper 7965WX, 512 GB system memory, 3090. I am trying to run this on windows 10.

It seems to fill up my GPU as well as system memory, so I imagine it is using the GPU. I've tried over a dozen commands to get it working. From a simple "llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf -ngl 99 -ot exps=CPU" to much more complicated ones with different options. I either get the same t/s or lower.

Maybe it is my GPU? I have a 5090 on the way and will test this again when that arrives.

If there are any launch commands you want me to try, I will give it a go.

@lingster
Copy link

lingster commented Feb 19, 2025

@slaren : Thank you so much for this PR. Hopefully some of these test results will be useful feedback:

So running with the following: (AMD EPYC 7713 64-Core Processor 256GB RAM + 2xA6000ada):
(the tensor settings taken from: https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml)

./build/bin/llama-cli -ub 256 --no-mmap  --tensor-split 19,20   --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf     --cache-type-k q4_0     --threads 16     --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --n-gpu-layers 36     -no-cnv     --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>" -ot '^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$=CUDA0'  -ot 'ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding=CUDA0' -ot '^model.embed_tokens=CPU' -ot '^model\\.layers\\..*\\.mlp$=CUDA0' -ot '^model\\.layers\\..*\\.self_attn$=CUDA1'

yields:

llama_perf_sampler_print:    sampling time =      24.12 ms /   266 runs   (    0.09 ms per token, 11029.11 tokens per second)
llama_perf_context_print:        load time =   64039.06 ms
llama_perf_context_print: prompt eval time =    1754.94 ms /    17 tokens (  103.23 ms per token,     9.69 tokens per second)
llama_perf_context_print:        eval time =   59164.02 ms /   248 runs   (  238.56 ms per token,     4.19 tokens per second)
llama_perf_context_print:       total time =   61035.05 ms /   265 tokens

compared with no -ot flag:

llama_perf_sampler_print:    sampling time =      38.74 ms /   343 runs   (    0.11 ms per token,  8852.98 tokens per second)
llama_perf_context_print:        load time =   63491.03 ms
llama_perf_context_print: prompt eval time =    1782.06 ms /    17 tokens (  104.83 ms per token,     9.54 tokens per second)
llama_perf_context_print:        eval time =   80634.65 ms /   325 runs   (  248.11 ms per token,     4.03 tokens per second)
llama_perf_context_print:       total time =   82707.95 ms /   342 tokens

GPUs are running at no more that 25%. Hope this is useful.
image

finally running with:

./build/bin/llama-cli -ub 256 --no-mmap  --tensor-split 19,20   --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf     --cache-type-k q4_0     --threads 16     --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --n-gpu-layers 36     -no-cnv     --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>" -ot ffn_up_exps=CPU -ot ffn_gate_exps=CPU
llama_perf_sampler_print:    sampling time =      26.05 ms /   300 runs   (    0.09 ms per token, 11514.55 tokens per second)
llama_perf_context_print:        load time =  117929.27 ms
llama_perf_context_print: prompt eval time =    3093.32 ms /    17 tokens (  181.96 ms per token,     5.50 tokens per second)
llama_perf_context_print:        eval time =   81377.05 ms /   282 runs   (  288.57 ms per token,     3.47 tokens per second)
llama_perf_context_print:       total time =   84748.35 ms /   299 tokens

Performance is lower given that GPU memory does not appear to be fully utilised:
image

@slaren
Copy link
Member Author

slaren commented Feb 19, 2025

So running with the following:

The names of these tensors do not match the names of the tensors in llama.cpp. I suggest running with -v to see which tensors are being affected by the filter, and using the gguf preview in HF to see the list of tensors in a model (for example, try: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S?show_file_info=DeepSeek-R1-UD-IQ1_S%2FDeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf).

Performance is lower given that GPU memory does not appear to be fully utilised:

You would need to increase the value of -ngl to take advantage of the GPU memory freed by the tensor overrides, or use a different set of overrides.

@jukofyork
Copy link
Contributor

--cache-type-k q4_0 likely hurts performance a lot too.

@lingster
Copy link

@slaren : thanks for pointing out that I was using the incorrect tensor names (infact the ktransformers were using the model names from safetensor format files and not gguf). So now I have rerun some tests and can see improved GPU usage, increasing to 50%:

./build/bin/llama-cli -ub 512 --no-mmap  --tensor-split 20,19   --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf      --threads 16     --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --n-gpu-layers 99   -ctk iq4_nl  -no-cnv     --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>" -ot 'ffn_up_exps=CPU'  -ot 'ffn_down_exps=CPU' -ot 'attn_kv_b=CUDA0' -ot 'ffn_up=CUDA0' -ot 'ffn_norm=CUDA1' -ot 'attn=CUDA1'

image

llama_perf_sampler_print:    sampling time =      27.61 ms /   366 runs   (    0.08 ms per token, 13256.55 tokens per second)
llama_perf_context_print:        load time =   60802.72 ms
llama_perf_context_print: prompt eval time =    2907.77 ms /    17 tokens (  171.05 ms per token,     5.85 tokens per second)
llama_perf_context_print:        eval time =  107544.58 ms /   348 runs   (  309.04 ms per token,     3.24 tokens per second)
llama_perf_context_print:       total time =  110841.42 ms /   365 tokens

However, using the -ot option it's seems impossible to utilise the full memory on the GPUs, the ffn_gate/ffn_up/ffn_down layers are simply too large to be loaded into 48gb vram. But this results in ~3.2 tok/s.

The best combination appears to be --ngl 36 --tensor-split 19,20, where I can get over 4.2 tok/s.

It seems that the bottleneck would be the CPU memory. With -ot we get more GPU utilisation, but this doesn't seem to make up for the time lost to having some of the layers on slower CPU memory.

@jukofyork : I'm using the --ctk q4_0 as per the unsloth: (https://unsloth.ai/blog/deepseekr1-dynamic) If I remove this and use the default. I get CUDA OOM. I have tried the different ctk values but it doesn't appear to be any noticeable performance improvements.

@Reactantvr
Copy link

Reactantvr commented Feb 19, 2025

So, sadly, this is not compatible with 5090s, but the current llama cpp is. I guess I will have to wait to test this with a 5090. This is the same build I used with my 3090, so the only variable here is the GPU being swapped.

image

@lingster
Copy link

lingster commented Feb 20, 2025

So, sadly, this is not compatible with 5090s, but the current llama cpp is. I guess I will have to wait to test this with a 5090. This is the same build I used with my 3090, so the only variable here is the GPU being swapped.

image

Have you tried rrcompiling with the right cuda architecture flag set? #4215

Looking at the nvidia docs sm_100 is what you need:
https://docs.nvidia.com/cuda/blackwell-compatibility-guide/index.html#application-compatibility-on-blackwell-architecture

@Reactantvr
Copy link

Reactantvr commented Feb 21, 2025

So, sadly, this is not compatible with 5090s, but the current llama cpp is. I guess I will have to wait to test this with a 5090. This is the same build I used with my 3090, so the only variable here is the GPU being swapped.
image

Have you tried rrcompiling with the right cuda architecture flag set? #4215

Looking at the nvidia docs sm_100 is what you need: https://docs.nvidia.com/cuda/blackwell-compatibility-guide/index.html#application-compatibility-on-blackwell-architecture

I looked into that and it seems to have done the trick.
For compiling it, I just changed the first command.

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="100"

This changed it to sm=100 while it compiled.

I still need to mess with settings to get the best speed, but here is the very first run.

llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf --flash-attn --threads 36 --temp 0.6 --min-p 0.05 --ctx-size 2048 --no-mmap -ngl 36 -ot exps=CPU

I am getting about 28% higher t/s for eval_time. For prompt eval_time, around a 50% improvement. (6.2 t/s / 14.1 t/s). This one leaves a lot of room for context as it only uses 17 GB of GPU memory.

llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf --flash-attn --threads 36 --temp 0.6 --min-p 0.05 --ctx-size 2048 --no-mmap -ngl 62 -ot exps=CPU

This command uses 26 GB of GPU memory, so still 6 GB for extra context over 2k context (I tested this and it uses 31.1 GB at 4096 context). This gets me around eval_time / prompt eval_time (7.8 t/s / 20.5 t/s).

Overall, the changes you made lead to a 66% performance increase on eval_time and around 100% performance increase on promp eval_time vs CPU only on a threadripper 7965WX, 512 GB memory, 5090. You are an absolute genius.

If you have some proper benches you want me to run, let me know.

@Reactantvr
Copy link

Reactantvr commented Feb 21, 2025

Another update.

llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf --flash-attn --threads 40 --temp 0.6 --min-p 0.05 --ctx-size 4096 --no-mmap -ngl 62 -ot exps=CPU

This uses up all my threads completely and I get a small performance bump.

image

82% performance increase now on eval time. Make me really want a 64 core threadripper now. Also, a second 5090 for more context. Using 31 GB of GPU memory right now at 4k. I am also curious if getting double the system memory bandwidth will make a difference after the 64 core threadripper upgrade. Maybe I can get up to 10-15 t/s.

Another thing I noticed is that it no longer drops off a cliff in inference speed as I continue a story. After 1k context generated, then another new 2k context, the new t/s was still 8.01 t/s. If this was CPU, it would have dropped by 25% by then.

The only real limiting factor is that 3.5k context seems like the absolute upper limit. I was having trouble with 4k context. I really need more context.

Another issue is that promp eval time is actually all over the place. Sometimes it is fast, sometimes it does this:

image

Another update:

I found that --flash-attn makes no difference. Also, I changed --no-mmap to --mlock and I get consistent promp eval now around 12 t/s. Still pretty amazing for running Q4 of R1 on CPU with one consumer grade GPU.

image

Yet another update. This time using Unsloth DeepSeek-R1-UD-IQ2_XXS-00001-of-00004.gguf. This model is still really good and uses only ~200 GB system memory and 27.5 GB GPU memory at 3k context.

image

Was able to get 3600 context max with this unsloth model.

The only real limited factor with this setup is context. Any chance KV cache allocation will resolve this issue?

@slaren
Copy link
Member Author

slaren commented Feb 22, 2025

Thanks for all the testing, I will try to get this ready for merging over the next days.

@ubergarm
Copy link

ubergarm commented Feb 23, 2025

@Reactantvr

found that --flash-attn makes no difference.

Yeah flash-attn is not supported yet in llama.cpp for DeepSeek-R1 psure, check out #11557

I changed --no-mmap to --mlock and I get consistent promp eval now

This is likely because without those args, llama.cpp defaults to normal mmap() which may not immediately cache all the weights from disk into memory page cache causing some variation in performance i think. using those args forces it all to be pre-allocated and in ram ready to go.

thanks for the benchmarks, i'm revisiting this exciting branch after playing with ktransformers and trying to figure out how they get almost 2x inference speeds on R1. i noticed when i disabled CUDA Graphs on ktransformers, it performs almost same as llama.cpp again... however cuda graphs only work when not offloading any experts into VRAM hrmm...

anyway enjoying the quest for more tok/sec! cheers!

@saood06
Copy link

saood06 commented Feb 23, 2025

The only real limited factor with this setup is context. Any chance KV cache allocation will resolve this issue?

You can try it with the PR the comment is from and the modification shown at the bottom of the comment: #11446 (comment) . This further comment showed it worked, #11446 (comment)

@lingster
Copy link

@Reactantvr Thanks for sharing your test results. Just curious, what is the ratings of your DIMM memory you are using on your setup? if you run nvtop do you see your GPU running at max compute? For me it seems that in my testing CPU memory is the limiting factor/bottleneck.

@Reactantvr
Copy link

Reactantvr commented Feb 23, 2025

@Reactantvr Thanks for sharing your test results. Just curious, what is the ratings of your DIMM memory you are using on your setup? if you run nvtop do you see your GPU running at max compute? For me it seems that in my testing CPU memory is the limiting factor/bottleneck.

My memory is 8x64 V-Color DDR5 6000 running at 4800. I didn't bother overclocking it yet because I am on 4 CCDs, which should limit me to around 230 GB/s. I assume I would not get more bandwidth until I upgrade to a 64 core Threadripper. Waiting on Shimada Peak for that. I'll probably run it at 6400 once I get that CPU.

I've never used nvtop. Plus, I am doing everything in Windows 10, so not sure if I can use it. I can give you stats from GPU-Z. Looks like GPU load is around 18-19%. This was using DeepSeek-R1-UD-IQ2_XXS.

image

@Readon
Copy link

Readon commented Feb 24, 2025

Works perfect for me, with dual E5 v2 + 2080Ti by running DeepSeek-R1-UD-Q2_K_XL. boost the token generation speed from 1.8tps to 3.3tps. While disable one node of numa, it can increase to 3.8tps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged ggml changes relating to the ggml tensor library for machine learning need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.