LLaMA NUMA could be better

llama.cpp is memory bound, let's see what has a lot of memory bandwidth:

NVIDIA V100 32GB: 900GB/s
2S Epyc 9000 (12xDDR5-4800/S): 922GB/s
NVIDIA A100 40GB: 1555GB/s
2S Xeon Max (HBM): 2TB/s
NVIDIA A100 80GB: 2TB/s
8S Xeon Scalable v4 (8xDDR5-4800/S): 2.45TB/s

NUMA systems have a lot because there are memory channels (or HBM for Xeon Max) on each socket. Okay, but the cheapest thing there is ~$6000. What if I'm not rich?

(\~$350 w/ 16GB, max \~128GB) common PC (2xDDR4-3200): 51GB/s
(\~$450 w/ 8GB, \~$600 w/ 16GB) Mac Mini M1: 68GB/s
(\~$600 w/ 8GB, \~$800 w/ 16GB) Mac Mini M2: 100GB/s
(\~$200 w/ 64GB, max \~768GB) 2S Xeon E5 v1 (4xDDR3-1600/S): 102GB/s [no F16C so f16 models slower]
(\~$250 w/ 64GB, max \~768GB) 2S Xeon E5 v2 (4xDDR3-1866/S): 119GB/s
(\~$350 w/ 128GB, max \~3000GB) 2S Xeon E5 v4 (4xDDR4-2400/S): 154GB/s

Hmm. Xeon E5-2690 v1 for $9 each on eBay. Let's see how we do.

$ lscpu
...
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:"
...
llama_print_timings:      sample time =   406.79 ms /   512 runs   (    0.79 ms per token)
llama_print_timings: prompt eval time = 27899.73 ms /   271 tokens (  102.95 ms per token)
llama_print_timings:        eval time = 74773.93 ms /   510 runs   (  146.62 ms per token)

Not terrible for 11-year-old hardware. Let's try it with two sockets:

$ lscpu
...
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       2
NUMA node(s):                    2

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:"
...
llama_print_timings:      sample time =   438.34 ms /   512 runs   (   	 0.86 ms per token)
llama_print_timings: prompt eval time = 27083.17 ms /   271 tokens (   99.94 ms per token)
llama_print_timings:        eval time = 129373.98 ms /   510 runs   (  253.67 ms per token)

Twice as many cores, twice as much memory bandwidth, and it's slower.
Oh, get_num_physical_cores() is broken, it's only returning 8/16 physical cores because "cpu cores" in /proc/cpuinfo is per-socket. I submitted a pull request.

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 16**
...
llama_print_timings:      sample time =   451.48 ms /   512 runs   (    0.88 ms per token)
llama_print_timings: prompt eval time = 16092.04 ms /   271 tokens (   59.38 ms per token)
llama_print_timings:        eval time = 102018.05 ms /   510 runs   (  200.04 ms per token)

Well, the prompt eval time is better. Maybe it benefits from hyperthreading?

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 32**
...
llama_print_timings:      sample time =   399.47 ms /   512 runs   (    0.78 ms per token)
llama_print_timings: prompt eval time = 14734.68 ms /   271 tokens (   54.37 ms per token)
llama_print_timings:        eval time = 97250.82 ms /   510 runs   (  190.69 ms per token)

Still something's not right.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 9**6**609 MB
node 0 free: 9**6**320 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 6**4**506 MB
node 1 free: 6**0**183 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

There it is. The whole model is loaded into the memory of one node. Let's try node interleave.

\# **echo 3 > /proc/sys/vm/drop_caches**

$ **numactl --interleave=0-1** ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 32**
...
llama_print_timings:      sample time =   397.83 ms /   512 runs   (    0.78 ms per token)
llama_print_timings: prompt eval time = 14894.56 ms /   271 tokens (   54.96 ms per token)
llama_print_timings:        eval time = 57045.66 ms /   510 runs   (  111.85 ms per token)

That's an improvement. Now it's >30% faster than a single socket and basically the same speed as my Ryzen 5 5600G from 2021, for about half the price. Let's see what happens on a machine with 4 NUMA nodes (16C/32T):

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 16**
...
llama_print_timings:      sample time =   456.06 ms /   512 runs   (    0.89 ms per token)
llama_print_timings: prompt eval time = 13954.33 ms /   271 tokens (   51.49 ms per token)
llama_print_timings:        eval time = 108925.89 ms /   510 runs   (  213.58 ms per token)

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 32**
...
llama_print_timings:      sample time =   514.30 ms /   512 runs   (    1.00 ms per token)
llama_print_timings: prompt eval time = 14288.35 ms /   271 tokens (   52.72 ms per token)
llama_print_timings:        eval time = 109354.09 ms /   510 runs   (  214.42 ms per token)

\# **echo 3 > /proc/sys/vm/drop_caches**

$ **numactl --interleave=0-3** ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 16**
...
llama_print_timings:      sample time =   477.99 ms /   512 runs   (    0.93 ms per token)
llama_print_timings: prompt eval time = 14164.87 ms /   271 tokens (   52.27 ms per token)
llama_print_timings:        eval time = 67402.83 ms /   510 runs   (  132.16 ms per token)

$ **numactl --interleave=0-3** ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" **-t 32**
...
llama_print_timings:      sample time =   489.53 ms /   512 runs   (    0.96 ms per token)
llama_print_timings: prompt eval time = 14511.16 ms /   271 tokens (   53.55 ms per token)
llama_print_timings:        eval time = 48623.21 ms /   510 runs   (   95.34 ms per token)

125% faster is alright.

I can submit a pull request that does the same thing (with a dependency on libnuma) if you want it.

But this is not the best we can do. Interleave spreads the model across all nodes randomly and there is still heavy slow cross-node memory access, that's just better than all the cores contending for the memory of one node.

The better way is to explicitly load 1/Nth of the model on each node and then have a thread pool per node which is assigned the operations on that subset of the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLaMA NUMA could be better #1437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLaMA NUMA could be better #1437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions