Replies: 1 comment
-
You can do this now with |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
With the release of Deepseek, there are now models that have drastically different computational requirements even within their layers.
For the example model; it has the following characteristics:
Medium memory density for the Attention part.
Very low memory density in the fully connected part (under 3%)
High memory density for the KV cache (compressed, needs more compute to decompress).
GPUs are great at tasks with high compute, low memory requirements. CPUs are much more useful for the fully connected layer (with only a couple out of 256 experts triggered with each token), because it would take an inordinate number of cards or extremely expensive HBM equipment to load it all in VRAM.
Currently, the whole layer is either offloaded or not, which makes the model much slower in both pp and tg than it needs to be.
I'd like to suggest the possibility of separately offloading these three calculations, or at least separating the Attention/KV from the FC part. I.e. to offload 'half a layer'. I suspect deepseek and other MoE models would run much faster on a setup such as epyc genoa + 1 48GB gpu than without this enhancement. I.e.:
Here,
--no-offload-experts
would be the new functionality that leaves the 'experts' in RAM, done by the CPU, while the rest of the model goes into VRAM, done by the GPU, which is pretty small in comparison; only about 9B, so that plus the compressed cache could easily fit in a single GPU, getting you a significant speedup; I would estimate around 2x at least given that the attention and MLA cache is quite compute intensive.Would load the first 3 layers (which are 'always active' experts) also in VRAM, increasing requirement of VRAM to about 31GB, and run even faster.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions