Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv model size is twice large then llama.cpp/main #66

Closed
hengjiUSTC opened this issue Apr 10, 2023 · 6 comments
Closed

kv model size is twice large then llama.cpp/main #66

hengjiUSTC opened this issue Apr 10, 2023 · 6 comments

Comments

@hengjiUSTC
Copy link

  1. Issue one
    I use following code to lode model model, tokenizer = LlamaCppModel.from_pretrained(MODEL_PATH) and got this print
llama_model_load: loading model from '/Users/jiheng/Documents/meta/code/fc/llama.cpp/models/ggml-model-q4_1_FineTuned.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 3
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 9311.39 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 11359.49 MB (+ 3216.00 MB per state)
llama_model_load: loading tensors from '/Users/jiheng/Documents/meta/code/fc/llama.cpp/models/ggml-model-q4_1_FineTuned.bin'
llama_model_load: model size =  9310.96 MB / num tensors = 363
llama_init_from_file: kv self size  =  800.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

but when loading using llama.cpp/main

llama_model_load: loading model from 'models/ggml-model-q4_1_Finetuned.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 3
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 9311.39 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 11359.49 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/ggml-model-q4_1_Finetuned.bin'
llama_model_load: model size =  9310.96 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 7 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

The kv self size is different.

  1. for this specific model, I couldn't get any result back from llama-cpp-python, but llamacpp/main gives correct response.
@hengjiUSTC
Copy link
Author

Probably related to #49. Same slow is also observed.

@abetlen
Copy link
Owner

abetlen commented Apr 10, 2023

@hengjiUSTC those outputs look slightly different, do you mind cloning the repo and comparing against the version of llama.cpp that's pinned here? It should be fairly recent (just updated earlier today)

@hengjiUSTC
Copy link
Author

hengjiUSTC commented Apr 11, 2023

I updated both llama.cpp repo and vendor/llama.cpp to same version (newest) and rebuild cpp code. Still kv self size is different. Link to model https://huggingface.co/Pi3141/gpt4-x-alpaca-native-13B-ggml/tree/main, and also happens for another model https://huggingface.co/Pi3141/alpaca-native-13B-ggml/tree/main. I think it isn't caused by version mismatch of llama.cpp code.
I use very simple code in python:

>>> from llama_cpp import Llama
>>> llm = Llama(model_path="/Users/jiheng/Documents/meta/code/llama.cpp/models/ggml-model-q4_1_FineTuned.bin")
llama.cpp: loading model from /Users/jiheng/Documents/meta/code/llama.cpp/models/ggml-model-q4_1_FineTuned.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 3
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 3216.00 MB per state)
llama_init_from_file: kv self size  =  800.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

And on llama.cpp side:

./main -m models/ggml-model-q4_1_Finetuned.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

main: seed = 1681179629
llama.cpp: loading model from models/ggml-model-q4_1_Finetuned.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 3
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 7 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 21

@hengjiUSTC
Copy link
Author

hengjiUSTC commented Apr 11, 2023

Changed a computer and using model https://huggingface.co/Pi3141/alpaca-native-13B-ggml/tree/main, result still the same.
llama-cpp-python. size is doubled

>>> from llama_cpp import Llama
>>> lm = Llama(model_path="/Users/jiheng/Downloads/ggml-model-q4_1.bin")
llama.cpp: loading model from /Users/jiheng/Downloads/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 3
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required  = 25573.12 MB (+ 6248.00 MB per state)
llama_init_from_file: kv self size  = 1560.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

llama.cpp

ource examples/alpaca.sh                                                                                                                   ✔  13s 
main: seed = 1681185433
llama.cpp: loading model from /Users/jiheng/Downloads/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 3
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required  = 25573.12 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 7 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 21


Result is same even I directly use build under vender/llama.cpp .

@ghost
Copy link

ghost commented Apr 15, 2023

The reason for the larger kv size is because f16_kv is set to false by default causing llama.cpp to use a 32-bit kv cache. The official llama.cpp kv cache size default is 16 bits and therefore is half the size. Currently I'm unsure why this lib uses the 32-bit cache by default.

@abetlen
Copy link
Owner

abetlen commented Apr 15, 2023

The reason for the larger kv size is because f16_kv is set to false by default causing llama.cpp to use a 32-bit kv cache. The official llama.cpp kv cache size default is 16 bits and therefore is half the size. Currently I'm unsure why this lib uses the 32-bit cache by default.

Thank you, I think this default changed since I originally implemented the parameters for the Llama class, I'll fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants