Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save prompt after initial prompt eval (fixes #1257) #1258

Closed

Conversation

ivanstepanovftw
Copy link
Collaborator

@ivanstepanovftw ivanstepanovftw commented Apr 30, 2023

BEFORE, First run

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello World War" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: session file does not exist, will create
sampling: repeat_last_n = 0, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello World Warriors!
I'm so excited to be here today to share my first
llama_print_timings:        load time =   431.16 ms
llama_print_timings:      sample time =     0.41 ms /    16 runs   (    0.03 ms per run)
llama_print_timings: prompt eval time =   278.80 ms /     4 tokens (   69.70 ms per token)
llama_print_timings:        eval time =  2586.86 ms /    15 runs   (  172.46 ms per run)
llama_print_timings:       total time =  3381.52 ms

Process finished with exit code 0

BEFORE, Second run

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello World War" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: loaded 270726188 bytes of session data!
main: session file has exact match for prompt!
sampling: repeat_last_n = 0, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello World War II was a global military conflict that lasted from 1939 to
llama_print_timings:        load time =   547.61 ms
llama_print_timings:      sample time =     0.46 ms /    16 runs   (    0.03 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2687.25 ms /    16 runs   (  167.95 ms per run)
llama_print_timings:       total time =  3055.03 ms

Process finished with exit code 0

AFTER FIX, First run

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello World War" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: session file does not exist, will create
sampling: repeat_last_n = 0, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello World Warriors!
I'm so excited to be here today to share my first
llama_print_timings:        load time =   885.16 ms
llama_print_timings:      sample time =     0.37 ms /    16 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =   305.21 ms /     4 tokens (   76.30 ms per token)
llama_print_timings:        eval time =  2555.70 ms /    15 runs   (  170.38 ms per run)
llama_print_timings:       total time =  3445.11 ms

Process finished with exit code 0

AFTER FIX, Second run

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello World War" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: loaded 270726188 bytes of session data!
sampling: repeat_last_n = 0, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello World Warriors!
I'm so excited to be here today to share my first
llama_print_timings:        load time =  1544.60 ms
llama_print_timings:      sample time =     0.35 ms /    16 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =   341.93 ms /     4 tokens (   85.48 ms per token)
llama_print_timings:        eval time =  2585.20 ms /    15 runs   (  172.35 ms per run)
llama_print_timings:       total time =  4134.06 ms

Process finished with exit code 0

@ivanstepanovftw ivanstepanovftw requested a review from ejones April 30, 2023 20:27
@ivanstepanovftw
Copy link
Collaborator Author

I probably need to bump version

@ivanstepanovftw
Copy link
Collaborator Author

ivanstepanovftw commented Apr 30, 2023

Closing this because I am very confused, and it does not solves the issue

@ejones
Copy link
Collaborator

ejones commented May 1, 2023

FWIW we should probably do/switch to this but based on my understanding it's not sufficient to just save after eval. If the prompt eval gets batched, that code path will be passed multiple times during prompt eval. Some additional bookkeeping is required (probably just storing prompt length and comparing n_past to it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants