Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : fix session saving/loading #3400

Merged
merged 5 commits into from
Oct 3, 2023
Merged

llama : fix session saving/loading #3400

merged 5 commits into from
Oct 3, 2023

Conversation

ggerganov
Copy link
Owner

ref #3397

I think this should fix the issue with saving/loading session data after #3228.
Make sure to delete any old chat data

@jluisreymejias Can you give this branch a try?

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Sep 29, 2023
@BarfingLemurs
Copy link
Contributor

(termux) confirmed --prompt-cache-all + --prompt-cache-ro now work, while on master loading a created cache file led to Segmentation fault

@Senemu
Copy link
Contributor

Senemu commented Oct 1, 2023

This fixes the crash for me, but it does not seem to use or update the cache file properly when the prompt changes. It is as if the previous prompt is still there, influencing the generation.

For example, if I generate a kanji mnemonic by running ./main -m llama-2-70b.Q5_K_M.gguf --file mnemonics.txt -r $'\nKanji:' --prompt-cache mnemonics.bin -c 0 -n -2 -t 8, the first run (with a new cache file) works as expected:

main: build = 1294 (b0670db)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696154328
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: mem required  = 46494.72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 573.88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: session file does not exist, will create
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 提 (propose)
Components: 扌 (left hand), 是 (go with)
Mnemonic: When you **propose** to someone, put a ring on the **left hand** and say “I ***go with*** you.” It’s how it works in some countries.

Kanji:
llama_print_timings:        load time =  3490.10 ms
llama_print_timings:      sample time =    31.98 ms /    44 runs   (    0.73 ms per token,  1375.77 tokens per second)
llama_print_timings: prompt eval time = 678353.51 ms /  1782 tokens (  380.67 ms per token,     2.63 tokens per second)
llama_print_timings:        eval time = 58432.71 ms /    43 runs   ( 1358.90 ms per token,     0.74 tokens per second)
llama_print_timings:       total time = 737161.46 ms

Generating another one for the same kanji (same prompt) works fine:

main: build = 1294 (b0670db)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696150253
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: mem required  = 46494.72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 573.88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: loaded a session with prompt size of 1782 tokens
main: session file has exact match for prompt!
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 提 (propose)
Components: 扌 (left hand), 是 (go with)
Mnemonic: When someone ***proposes*** something to you, you will either **go with it** or not. It’s like your left hand is saying: “this way!” and your right hand saying: “that way!”. You need to pick one.

Kanji:
llama_print_timings:        load time =  3586.75 ms
llama_print_timings:      sample time =    40.55 ms /    58 runs   (    0.70 ms per token,  1430.44 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 72142.00 ms /    57 runs   ( 1265.65 ms per token,     0.79 tokens per second)
llama_print_timings:       total time = 83733.50 ms

But if I change the last paragraph of the prompt (the kanji for which I want a mnemonic), this happens:

main: build = 1294 (b0670db)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696152733
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: mem required  = 46494.72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 573.88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: loaded a session with prompt size of 1782 tokens
main: session file matches 1755 / 1786 tokens of prompt
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 配 (hand out)
Components: 酉 (sign of the bird), 己 (oneself)
Mnemonic: When one needs to **propose something**, he should make sure that this proposal is really his (**oneself**) before bringing up the subject in front of others. The ***sign of the bird*** is a sign of peace, so it’s best if the matter can be settled amicably.

Kanji:
llama_print_timings:        load time =  3509.82 ms
llama_print_timings:      sample time =    48.35 ms /    69 runs   (    0.70 ms per token,  1427.12 tokens per second)
llama_print_timings: prompt eval time = 18194.90 ms /    31 tokens (  586.93 ms per token,     1.70 tokens per second)
llama_print_timings:        eval time = 90051.99 ms /    68 runs   ( 1324.29 ms per token,     0.76 tokens per second)
llama_print_timings:       total time = 109755.13 ms

The output references the kanji of the previous generation (“propose”), even though it is nowhere to be found in the new prompt!

Subsequent runs with the same prompt would say that the session file matches the prompt exactly, but “propose” and its keywords would keep reappearing.

@ggerganov
Copy link
Owner Author

@Senemu Could you please try your test with the latest version of this branch and see if the issue is resolved?

@Senemu
Copy link
Contributor

Senemu commented Oct 2, 2023

The issue is resolved in the current version of this branch! 👏

llama.h Outdated
Comment on lines 333 to 334
// c0 < -1 : [0, c1]
// c1 < -1 : [c0, inf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be c0 < 0?

@ggerganov
Copy link
Owner Author

@Senemu I made some more changes, hoping I didn't break it again. Will merge it now without testing, but if you spot any issues again - let us know

@ggerganov ggerganov merged commit ac2219f into master Oct 3, 2023
32 of 33 checks passed
@Senemu
Copy link
Contributor

Senemu commented Oct 3, 2023

ac2219f breaks the session cache even when using exactly the same prompt.

The first run (without a cache file) works as expected, but a rerun outputs garbage:

main: build = 1315 (ac2219f)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1696150253
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from llama-2-70b.Q5_K_M.gguf (version GGUF V2 (latest))
[…]
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 68,98 B
llm_load_print_meta: model size       = 45,40 GiB (5,65 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0,23 MB
llm_load_tensors: mem required  = 46494,72 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1280,00 MB
llama_new_context_with_model: compute buffer total size = 573,88 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from 'mnemonics.bin'
main: loaded a session with prompt size of 1782 tokens
main: session file has exact match for prompt!
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -2, n_keep = 0


 For each kanji character, write a Markdown‐formatted mnemonic that uses its keyword and the keyword of all its components.

[…]

Kanji: 謝 (apologize)
Components: 言 (say), 射 (shoot)
Mnemonic: **Shot** first, ***apologize*** (**say** you are sorry) later.

Kanji: 提 (propose)
Components: 扌 (left hand), 是 (go with)
Mnemonic: When What Where Why





## Markdown [end of text]

llama_print_timings:        load time =  3831,54 ms
llama_print_timings:      sample time =    10,80 ms /    14 runs   (    0,77 ms per token,  1295,94 tokens per second)
llama_print_timings: prompt eval time =     0,00 ms /     1 tokens (    0,00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 17569,35 ms /    13 runs   ( 1351,49 ms per token,     0,74 tokens per second)
llama_print_timings:       total time = 18710,15 ms

@cebtenzzre
Copy link
Collaborator

ac2219f breaks the session cache even when using exactly the same prompt.

If this doesn't get resolved soon, open a new issue (or reopen an old one, if there is one that applies) so this doesn't get missed.

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 5, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (24 commits)
  convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299)
  readme : add project status link
  ggml : fix build after ggerganov#3329
  llm : add Refact model (ggerganov#3329)
  sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468)
  finetune : readme fix typo (ggerganov#3465)
  ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453)
  main : consistent prefix/suffix coloring (ggerganov#3425)
  llama : fix session saving/loading (ggerganov#3400)
  llama : expose model's rope_freq_scale in the API (ggerganov#3418)
  metal : alibi for arbitrary number of heads (ggerganov#3426)
  cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273)
  Work on the BPE tokenizer (ggerganov#3252)
  convert : fix vocab size when not defined in hparams (ggerganov#3421)
  cmake : increase minimum version for add_link_options (ggerganov#3444)
  CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402)
  gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408)
  gguf : general usability improvements (ggerganov#3409)
  cmake : make CUDA flags more similar to the Makefile (ggerganov#3420)
  finetune : fix ggerganov#3404 (ggerganov#3437)
  ...
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
* llama : fix session saving/loading

* llama : temp fix for clearing "future" tokens from the KV cache

* llama : fix handling of "future" tokens when loading sessions

* llama : fix comments for llama_kv_cache API
ggerganov added a commit that referenced this pull request Oct 11, 2023
@ggerganov
Copy link
Owner Author

@Senemu The issue should be fixed on latest master.

@Senemu
Copy link
Contributor

Senemu commented Oct 11, 2023

It is fixed in b8fe4b5.

Thank you very much!

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 12, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (34 commits)
  examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436)
  docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597)
  cmake : fix add_compile_options on macOS
  typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592)
  ci : check if there is enough VRAM (ggerganov#3596)
  server : add completion mode (no chat) (ggerganov#3582)
  prompts : add mnemonics.txt
  server : fix kv cache management (ggerganov#3588)
  main : fix session loading bug (ggerganov#3400)
  server : add parameter -tb N, --threads-batch N (ggerganov#3584)
  common : fix mirostat state when using multiple sequences (ggerganov#3543)
  batched : add bench tool (ggerganov#3545)
  examples : add batched.swift + improve CI for swift (ggerganov#3562)
  Add MPT model to supported models in README.md (ggerganov#3574)
  Minor improvements in GPT2 tokenizer (ggerganov#3567)
  readme : add bloom (ggerganov#3570)
  llm : add bloom models (ggerganov#3553)
  swift : improvements and fixes (ggerganov#3564)
  llm : add MPT support (ggerganov#3417)
  infill. : fix tokenization (ggerganov#3508)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants