Different outputs with --session flag (cache prompt #1169) #1257

ivanstepanovftw · 2023-04-30T18:55:20Z

Introduced in #1169

Current Behavior

Cache miss:

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello, World!" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: session file does not exist, will create
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 1.000000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello, World! I’m a 20-something year old living in the beautiful Pacific
llama_print_timings:        load time =   514.34 ms
llama_print_timings:      sample time =     0.39 ms /    16 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =   362.71 ms /     5 tokens (   72.54 ms per token)
llama_print_timings:        eval time =  2644.36 ms /    15 runs   (  176.29 ms per run)
llama_print_timings:       total time =  3486.63 ms

Process finished with exit code 0

Cache hit:

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/7B/ggml-model-q4_0_0.bin --prompt "Hello, World!" --threads 8 --seed 1 --n_predict 16 --ignore-eos --repeat_last_n 0 --temp 0 --session session/7B/hello.bin
main: seed = 1
llama.cpp: loading model from models/7B/ggml-model-q4_0_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from session/7B/hello.bin..
main: could not load session file, will recreate
main: session file has exact match for prompt!
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 1.000000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0


 Hello, World!
 using System;
using System.Collections.Generic;

llama_print_timings:        load time =   515.93 ms
llama_print_timings:      sample time =     0.32 ms /    16 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2805.72 ms /    16 runs   (  175.36 ms per run)
llama_print_timings:       total time =  3116.06 ms

Process finished with exit code 0

The text was updated successfully, but these errors were encountered:

ivanstepanovftw · 2023-04-30T18:55:50Z

CC: @ejones

ivanstepanovftw · 2023-04-30T21:06:03Z

Cannot find a fix for that...

ivanstepanovftw · 2023-04-30T21:59:13Z

Issue is with n_past as I understand.

ivanstepanovftw · 2023-04-30T22:46:12Z

So the hotfix is to substract from last_n_tokens when saving session

llama_save_session_file(ctx, session_filepath.c_str(), last_n_tokens.data(), last_n_tokens.size() - 1);

ejones · 2023-05-01T00:42:55Z

Thanks, will comment on the PR I see you just opened

ivanstepanovftw · 2023-05-01T10:16:24Z

Fixed in #1263

ejones · 2023-05-05T15:15:04Z

@ivanstepanovftw is cold -> warm repeatable outputs a strong requirement? From my understanding of the rng, it seems like you only get this if you have the rng state at the exact token position. When using a prefix of a restoed state, the pseudo rng is farther along so sampling will be different.

I ask because I'm trying to simplify the case of saving prompt + generation. The prompt state can be recovered as a prefix but sampling will be different due to rng. Could be missing something though.

ivanstepanovftw · 2023-05-18T19:45:55Z

I have tested it with greedy sampling (temp=0)

This comment was marked as resolved.

Sign in to view

ivanstepanovftw closed this as completed Apr 30, 2023

This comment was marked as outdated.

Sign in to view

ivanstepanovftw reopened this Apr 30, 2023

ivanstepanovftw changed the title ~~[User] Different outputs with --session flag (after #1169)~~ Different outputs with --session flag (cache prompt #1169) Apr 30, 2023

ivanstepanovftw added a commit to ivanstepanovftw/llama.cpp that referenced this issue Apr 30, 2023

Save prompt after initial prompt eval (fixes ggml-org#1257)

dd88594

ivanstepanovftw added the bug Something isn't working label Apr 30, 2023

ivanstepanovftw added a commit to ivanstepanovftw/llama.cpp that referenced this issue May 1, 2023

Hotfix prompt caching introduced in ggml-org#1169, fixes ggml-org#1257

c9fdebc

ggerganov mentioned this issue May 1, 2023

llama : fix session load / save #1263

Merged

ivanstepanovftw closed this as completed May 1, 2023

ivanstepanovftw mentioned this issue Aug 1, 2023

Prompt cache output mismatch #2479

Closed

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different outputs with --session flag (cache prompt #1169) #1257

Different outputs with --session flag (cache prompt #1169) #1257

ivanstepanovftw commented Apr 30, 2023 •

edited

Loading

ivanstepanovftw commented Apr 30, 2023

This comment was marked as resolved.

This comment was marked as outdated.

ivanstepanovftw commented Apr 30, 2023

ivanstepanovftw commented Apr 30, 2023

ivanstepanovftw commented Apr 30, 2023 •

edited

Loading

ejones commented May 1, 2023

ivanstepanovftw commented May 1, 2023

ejones commented May 5, 2023

ivanstepanovftw commented May 18, 2023 •

edited

Loading

Different outputs with --session flag (cache prompt #1169) #1257

Different outputs with --session flag (cache prompt #1169) #1257

Comments

ivanstepanovftw commented Apr 30, 2023 • edited Loading

Current Behavior

ivanstepanovftw commented Apr 30, 2023

This comment was marked as resolved.

This comment was marked as outdated.

ivanstepanovftw commented Apr 30, 2023

ivanstepanovftw commented Apr 30, 2023

ivanstepanovftw commented Apr 30, 2023 • edited Loading

ejones commented May 1, 2023

ivanstepanovftw commented May 1, 2023

ejones commented May 5, 2023

ivanstepanovftw commented May 18, 2023 • edited Loading

ivanstepanovftw commented Apr 30, 2023 •

edited

Loading

ivanstepanovftw commented Apr 30, 2023 •

edited

Loading

ivanstepanovftw commented May 18, 2023 •

edited

Loading