Llama 4 / MoE idea #12781
Replies: 2 comments 6 replies
-
You can already load the entire model into GPU memory and swap from GPU memory to RAM by enabling GGML_CUDA_ENABLE_UNIFIED_MEMORY as described under https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#unified-memory. I wouldn't really call it fast. For imatrix computation keeping the model in RAM while using GPU acceleration is twice as fast but feel free to play around with it. The main reason GGML_CUDA_ENABLE_UNIFIED_MEMORY is slow is because copying data from RAM to GPU memory is limited to PCIe 4.0 x16 speed which is much slower than RAM speed which again is much slower than GPU memory speed. Your described caching method and the resulting manually optimized GPU memory management will likely perform better than GGML_CUDA_ENABLE_UNIFIED_MEMORY but I doubt that it will be faster than just running the model from CPU with GPU acceleration but I could be wrong. If you are interested in how unified memory works, I recommend reading https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ |
Beta Was this translation helpful? Give feedback.
-
As nicoboss already wrote, the bottleneck seems to be the on-demand copying of the experts into GPU memory. In Scout, every MLP is half routed expert and half shared expert. In Maverick, an MoE is only use in every other layer. It may be more efficient to perform the inference in a way where the routed experts are inferenced by the CPU which is loading weights from cpu memory and attention layers and shared experts reside in GPU memory. The problem is of course, that the CPU will be bottlenecking the inference since it dominating the MLP computation time. On the other hand, it should be possible to fit all the shared expert weights of scout completely into 24 GB of VRAM. (Its only 6B params). Same for Maverick. |
Beta Was this translation helpful? Give feedback.
-
I'm wondering if this idea could work
I’d like to run Llama 4 Scout (17B active, 16 experts total) on a machine with 24GB VRAM and large RAM.
Idea:
Do you think this makes sense in llama.cpp's architecture?
Or is it too slow or complex to be practical?
Beta Was this translation helpful? Give feedback.
All reactions