Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow generation on M2 MBP #1085

Closed
weiqi-dyania opened this issue Apr 20, 2023 · 2 comments
Closed

Extremely slow generation on M2 MBP #1085

weiqi-dyania opened this issue Apr 20, 2023 · 2 comments

Comments

@weiqi-dyania
Copy link

Hi,

I have a M2 MBP with 8GB memory. I compiled the code from the latest main branch and tested on the quantized llama-7b model. But it gives me extremely slow generation speed. It takes more than 3 min to just generate 48 tokens. Please see my logs below. I exactly followed the steps in README and used the same example for test. I used all the 8 threads by specifying -t 8. I'm wondering if there is anything I'm missing? Thanks in advance!

-> % make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -p "Building a website can be done in 10 simple steps:" -n 48
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 1682010023
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 48, n_keep = 0


 Building a website can be done in 10 simple steps:
Find a name for your site. You will want to use this name on your business card, stationery, and on the Web. Try not to use hyphens, underscores or numbers. The best names are short,
llama_print_timings:        load time =  2603.27 ms
llama_print_timings:      sample time =    40.35 ms /    48 runs   (    0.84 ms per run)
llama_print_timings: prompt eval time =  5681.11 ms /    14 tokens (  405.79 ms per token)
llama_print_timings:        eval time = 196000.64 ms /    47 runs   ( 4170.23 ms per run)
llama_print_timings:       total time = 201747.07 ms
@prusnak
Copy link
Collaborator

prusnak commented Apr 20, 2023

The chip has only 4 performance cores - use -t 4 not -t 8.

Also your system does not have a lot of memory, so my guess is that is is swapping a lot.

Try rebooting the machine so nothing else is loaded and try running llama as the first program.

@prusnak prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2023
@weiqi-dyania
Copy link
Author

Cool, thanks. Rebooting the machine makes things work! Interestingly that before and after rebooting the machine I observed similar memory consumption of the main process. Both of them consumes 200M+ as per top. But the generation speed is significantly improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants