Extremely slow generation on M2 MBP #1085

weiqi-dyania · 2023-04-20T17:14:32Z

Hi,

I have a M2 MBP with 8GB memory. I compiled the code from the latest main branch and tested on the quantized llama-7b model. But it gives me extremely slow generation speed. It takes more than 3 min to just generate 48 tokens. Please see my logs below. I exactly followed the steps in README and used the same example for test. I used all the 8 threads by specifying -t 8. I'm wondering if there is anything I'm missing? Thanks in advance!

-> % make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -p "Building a website can be done in 10 simple steps:" -n 48
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 1682010023
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 48, n_keep = 0


 Building a website can be done in 10 simple steps:
Find a name for your site. You will want to use this name on your business card, stationery, and on the Web. Try not to use hyphens, underscores or numbers. The best names are short,
llama_print_timings:        load time =  2603.27 ms
llama_print_timings:      sample time =    40.35 ms /    48 runs   (    0.84 ms per run)
llama_print_timings: prompt eval time =  5681.11 ms /    14 tokens (  405.79 ms per token)
llama_print_timings:        eval time = 196000.64 ms /    47 runs   ( 4170.23 ms per run)
llama_print_timings:       total time = 201747.07 ms

The text was updated successfully, but these errors were encountered:

prusnak · 2023-04-20T19:28:16Z

The chip has only 4 performance cores - use -t 4 not -t 8.

Also your system does not have a lot of memory, so my guess is that is is swapping a lot.

Try rebooting the machine so nothing else is loaded and try running llama as the first program.

weiqi-dyania · 2023-04-20T20:13:14Z

Cool, thanks. Rebooting the machine makes things work! Interestingly that before and after rebooting the machine I observed similar memory consumption of the main process. Both of them consumes 200M+ as per top. But the generation speed is significantly improved.

prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow generation on M2 MBP #1085

Extremely slow generation on M2 MBP #1085

weiqi-dyania commented Apr 20, 2023

prusnak commented Apr 20, 2023

weiqi-dyania commented Apr 20, 2023

Extremely slow generation on M2 MBP #1085

Extremely slow generation on M2 MBP #1085

Comments

weiqi-dyania commented Apr 20, 2023

prusnak commented Apr 20, 2023

weiqi-dyania commented Apr 20, 2023