You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a M2 MBP with 8GB memory. I compiled the code from the latest main branch and tested on the quantized llama-7b model. But it gives me extremely slow generation speed. It takes more than 3 min to just generate 48 tokens. Please see my logs below. I exactly followed the steps in README and used the same example for test. I used all the 8 threads by specifying -t 8. I'm wondering if there is anything I'm missing? Thanks in advance!
-> % make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -p "Building a website can be done in 10 simple steps:" -n 48
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: Nothing to be done for `default'.
main: seed = 1682010023
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 59.11 KB
llama_model_load_internal: mem required = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 48, n_keep = 0
Building a website can be done in 10 simple steps:
Find a name for your site. You will want to use this name on your business card, stationery, and on the Web. Try not to use hyphens, underscores or numbers. The best names are short,
llama_print_timings: load time = 2603.27 ms
llama_print_timings: sample time = 40.35 ms / 48 runs ( 0.84 ms per run)
llama_print_timings: prompt eval time = 5681.11 ms / 14 tokens ( 405.79 ms per token)
llama_print_timings: eval time = 196000.64 ms / 47 runs ( 4170.23 ms per run)
llama_print_timings: total time = 201747.07 ms
The text was updated successfully, but these errors were encountered:
Cool, thanks. Rebooting the machine makes things work! Interestingly that before and after rebooting the machine I observed similar memory consumption of the main process. Both of them consumes 200M+ as per top. But the generation speed is significantly improved.
Hi,
I have a M2 MBP with 8GB memory. I compiled the code from the latest main branch and tested on the quantized llama-7b model. But it gives me extremely slow generation speed. It takes more than 3 min to just generate 48 tokens. Please see my logs below. I exactly followed the steps in README and used the same example for test. I used all the 8 threads by specifying
-t 8
. I'm wondering if there is anything I'm missing? Thanks in advance!The text was updated successfully, but these errors were encountered: