Performance enhancement: Offloading layers partially #12707

AphidGit · 2025-04-02T08:22:38Z

AphidGit
Apr 2, 2025

With the release of Deepseek, there are now models that have drastically different computational requirements even within their layers.

For the example model; it has the following characteristics:

Medium memory density for the Attention part.
Very low memory density in the fully connected part (under 3%)
High memory density for the KV cache (compressed, needs more compute to decompress).

GPUs are great at tasks with high compute, low memory requirements. CPUs are much more useful for the fully connected layer (with only a couple out of 256 experts triggered with each token), because it would take an inordinate number of cards or extremely expensive HBM equipment to load it all in VRAM.

Currently, the whole layer is either offloaded or not, which makes the model much slower in both pp and tg than it needs to be.

I'd like to suggest the possibility of separately offloading these three calculations, or at least separating the Attention/KV from the FC part. I.e. to offload 'half a layer'. I suspect deepseek and other MoE models would run much faster on a setup such as epyc genoa + 1 48GB gpu than without this enhancement. I.e.:

llama-server -m deepseek_v3.gguf -c 16384 -t 96 --no-offload-experts "1-61" -ngl 61

Here, --no-offload-experts would be the new functionality that leaves the 'experts' in RAM, done by the CPU, while the rest of the model goes into VRAM, done by the GPU, which is pretty small in comparison; only about 9B, so that plus the compressed cache could easily fit in a single GPU, getting you a significant speedup; I would estimate around 2x at least given that the attention and MLA cache is quite compute intensive.

llama-server -m deepseek_v3.gguf -c 16384 -t 96 --no-offload-experts "4-61" -ngl 61

Would load the first 3 layers (which are 'always active' experts) also in VRAM, increasing requirement of VRAM to about 31GB, and run even faster.

Thoughts?

slaren · 2025-04-02T13:04:05Z

slaren
Apr 2, 2025
Maintainer

You can do this now with --override-tensor. See #11397 for more details.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance enhancement: Offloading layers partially #12707

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Performance enhancement: Offloading layers partially #12707

AphidGit Apr 2, 2025

Replies: 1 comment

slaren Apr 2, 2025 Maintainer

AphidGit
Apr 2, 2025

slaren
Apr 2, 2025
Maintainer