Llama 4 / MoE idea #12781

jacekpoplawski · 2025-04-06T14:48:26Z

jacekpoplawski
Apr 6, 2025

I'm wondering if this idea could work

I’d like to run Llama 4 Scout (17B active, 16 experts total) on a machine with 24GB VRAM and large RAM.

Idea:

Load the full 4-bit model into RAM.
At each token step, load only the 2 active experts per layer into VRAM.
Keep a small expert cache on GPU (e.g. 4–6 experts max), evict unused ones as needed.
Group tokens in the batch by expert usage to reduce swapping.
Use RAM as a backend store for expert weights.

Do you think this makes sense in llama.cpp's architecture?
Or is it too slow or complex to be practical?

nicoboss · 2025-04-06T20:32:27Z

nicoboss
Apr 6, 2025

You can already load the entire model into GPU memory and swap from GPU memory to RAM by enabling GGML_CUDA_ENABLE_UNIFIED_MEMORY as described under https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#unified-memory. I wouldn't really call it fast. For imatrix computation keeping the model in RAM while using GPU acceleration is twice as fast but feel free to play around with it. The main reason GGML_CUDA_ENABLE_UNIFIED_MEMORY is slow is because copying data from RAM to GPU memory is limited to PCIe 4.0 x16 speed which is much slower than RAM speed which again is much slower than GPU memory speed. Your described caching method and the resulting manually optimized GPU memory management will likely perform better than GGML_CUDA_ENABLE_UNIFIED_MEMORY but I doubt that it will be faster than just running the model from CPU with GPU acceleration but I could be wrong. If you are interested in how unified memory works, I recommend reading https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

0 replies

cpldcpu · 2025-04-08T02:20:26Z

cpldcpu
Apr 8, 2025

As nicoboss already wrote, the bottleneck seems to be the on-demand copying of the experts into GPU memory.

In Scout, every MLP is half routed expert and half shared expert. In Maverick, an MoE is only use in every other layer.

It may be more efficient to perform the inference in a way where the routed experts are inferenced by the CPU which is loading weights from cpu memory and attention layers and shared experts reside in GPU memory.

The problem is of course, that the CPU will be bottlenecking the inference since it dominating the MLP computation time. On the other hand, it should be possible to fit all the shared expert weights of scout completely into 24 GB of VRAM. (Its only 6B params). Same for Maverick.

6 replies

cpldcpu Apr 8, 2025

There are 17B active weights in both models.

Scout has 6B in routed experts, hence 11B for rest (attention, embedding, shared experts).
Maverick has 3B in routed experts, 14B remaining.

With 24GB VRAM there would be plenty of space for KV cache remaining, especially with Q4.

arnfaldur Apr 8, 2025

Interesting, I couldn't find those numbers so I just made rough estimates. 400B / 128 = 3.125B which supports your claims.

This would mean that Maverick inference could be as fast as a 3B on a CPU (I get ~20 t/s with Llama 3.2 3B on 7950X). That's actually a pretty usable speed for some use cases.

cpldcpu Apr 9, 2025

Yes, I also had to calculate the numbers myself and dig through the source as they were not available.

From the config.json of llama4

    "for_llm_compressor": false,
    "head_dim": 128,
    "hidden_act": "silu",
    "hidden_size": 5120,
    "initializer_range": 0.02,
    "interleave_moe_layer_step": 1,
    "intermediate_size": 8192,
    "intermediate_size_mlp": 16384,
    "model_type": "llama4_text",
    "num_experts_per_tok": 1,
    "num_hidden_layers": 48,

In scout, every MLP layer is equally made up by a shared expert with hidden dimension 8192 ("intermediate_size") and a routed expert with the same dimension. The total inner dimension of the MLP is 16384.

Since the MLP is using SiGLU activuation, the number of parameters for the routed expert is 5120 * 8192 * 3 = 125829120
Multiplied by number of layers: 48 * 125829120 = 6039797760, which is ~6B.

In Maverick, only every other MLP layer is using a shared expert, hence the number needs to be divided by two.

In scout, there are 16 experts to chose from, hence the total number of weights that needs to be stored in CPU memory is ~6B*16= which is quite accurately 45 Gb for Q4 quantization.

For maverick, there are 128 experts, but half the layers, so we end up with ~3B*128 and 180 Gb weights for Q4.

Thellton Apr 10, 2025

Both Scout and Maverick have n routed experts (16 and 128 respecively), and one shared expert. The shared expert could be offloaded to the GPU, and the rest kept in CPU RAM. This could reduce the CPU load by half, and double the inference speed. I don't know the size of the shared non-expert weights, but they could possibly be put on the GPU as well. The CPU would then only process the memory hungry routed experts.

I get 10 t/s using Llama 3.1 8B on an AMD 7950X CPU and expect the CPU load to be similar for a single expert (the parameter count is very similar). With the described GPU offloading, both Scout and maverick could have similar performance, with only a single 8-16GB GPU taking care of the shared weights.

This might be worth pursuing.

concur, the shared expert in scout for example is roughly 7.5x larger than the Top-1 expert that's being selected, which means that the top-1 expert can be loaded into RAM and run on hardware that has 7.5x slower bandwidth than the GPU that the shared expert is loaded on. that'd mean for example I'd go from 4tk/s with naive layer offloading (25 layers on a 16GB GPU with 512GB/s of bandwidth, and the remainder on a 6 core Ryzen with 34.4GB/s of bandwidth) to double digits of tokens per second (in my case, I'd probably be looking at 17ish tokens per second at empty context due to the CPU bandwidth limitation, but in theory with better RAM and or DDR5 I could easily push higher)

Pardon the crossing out, seems I misunderstood the size of some architectural elements of Scout... point still stands; something like this idea just might push tokens per second up a lot. just not as much in my case...

cpldcpu Apr 10, 2025

Keep in my that the attention layers and KV cache would be fully on the GPU. So with longer context, the impact of the CPU bottleneck would be diminished.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 4 / MoE idea #12781

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Llama 4 / MoE idea #12781

Replies: 2 comments · 6 replies

Replies: 2 comments 6 replies