Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different outputs with --session flag (cache prompt #1169) #1257

Closed
ivanstepanovftw opened this issue Apr 30, 2023 · 10 comments
Closed

Different outputs with --session flag (cache prompt #1169) #1257

ivanstepanovftw opened this issue Apr 30, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented Apr 30, 2023

Introduced in #1169

Current Behavior

Cache miss:

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello, World!" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: session file does not exist, will create
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 1.000000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello, World! I’m a 20-something year old living in the beautiful Pacific
llama_print_timings:        load time =   514.34 ms
llama_print_timings:      sample time =     0.39 ms /    16 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =   362.71 ms /     5 tokens (   72.54 ms per token)
llama_print_timings:        eval time =  2644.36 ms /    15 runs   (  176.29 ms per run)
llama_print_timings:       total time =  3486.63 ms

Process finished with exit code 0

Cache hit:

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello, World!" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: could not load session file, will recreate
main: session file has exact match for prompt!
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 1.000000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello, World!
 using System;
using System.Collections.Generic;

llama_print_timings:        load time =   515.93 ms
llama_print_timings:      sample time =     0.32 ms /    16 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2805.72 ms /    16 runs   (  175.36 ms per run)
llama_print_timings:       total time =  3116.06 ms

Process finished with exit code 0
@ivanstepanovftw
Copy link
Collaborator Author

CC: @ejones

@ivanstepanovftw

This comment was marked as resolved.

@ivanstepanovftw

This comment was marked as outdated.

@ivanstepanovftw ivanstepanovftw changed the title [User] Different outputs with --session flag (after #1169) Different outputs with --session flag (cache prompt #1169) Apr 30, 2023
ivanstepanovftw added a commit to ivanstepanovftw/llama.cpp that referenced this issue Apr 30, 2023
@ivanstepanovftw
Copy link
Collaborator Author

Cannot find a fix for that...

@ivanstepanovftw ivanstepanovftw added the bug Something isn't working label Apr 30, 2023
@ivanstepanovftw
Copy link
Collaborator Author

Issue is with n_past as I understand.

@ivanstepanovftw
Copy link
Collaborator Author

ivanstepanovftw commented Apr 30, 2023

So the hotfix is to substract from last_n_tokens when saving session

llama_save_session_file(ctx, session_filepath.c_str(), last_n_tokens.data(), last_n_tokens.size() - 1);

ivanstepanovftw added a commit to ivanstepanovftw/llama.cpp that referenced this issue May 1, 2023
@ejones
Copy link
Collaborator

ejones commented May 1, 2023

Thanks, will comment on the PR I see you just opened

@ivanstepanovftw
Copy link
Collaborator Author

Fixed in #1263

@ejones
Copy link
Collaborator

ejones commented May 5, 2023

@ivanstepanovftw is cold -> warm repeatable outputs a strong requirement? From my understanding of the rng, it seems like you only get this if you have the rng state at the exact token position. When using a prefix of a restoed state, the pseudo rng is farther along so sampling will be different.

I ask because I'm trying to simplify the case of saving prompt + generation. The prompt state can be recovered as a prefix but sampling will be different due to rng. Could be missing something though.

@ivanstepanovftw
Copy link
Collaborator Author

ivanstepanovftw commented May 18, 2023

I have tested it with greedy sampling (temp=0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants