You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Yes ] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I was trying to test kv cache quantization on long context generation to see how much VRAM it saves. The expected behavior is for the code to either run properly or give an error trace that tells me what went wrong.
Current Behavior
The code crashes without giving an error.
Environment and Context
The environment is google colab. The code that I tried to run is as follow:
Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Operating System, e.g. for Linux:
$ uname -a
SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version
Failure Information (for bugs)
No failure information has been given. But the model information has been printed out:
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I was trying to test kv cache quantization on long context generation to see how much VRAM it saves. The expected behavior is for the code to either run properly or give an error trace that tells me what went wrong.
Current Behavior
The code crashes without giving an error.
Environment and Context
The environment is google colab. The code that I tried to run is as follow:
$ lscpu
$ uname -a
Failure Information (for bugs)
No failure information has been given. But the model information has been printed out:
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8b Instruct
llama_model_loader: - kv 3: general.organization str = Unsloth
llama_model_loader: - kv 4: general.finetune str = instruct
llama_model_loader: - kv 5: general.basename str = meta-llama-3.1
llama_model_loader: - kv 6: general.size_label str = 8B
llama_model_loader: - kv 7: llama.block_count u32 = 32
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 4096
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 11: llama.attention.head_count u32 = 32
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: general.file_type u32 = 2
llama_model_loader: - kv 16: llama.vocab_size u32 = 128256
llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 18: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 19: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 20: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 22: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 26: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8b Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 4437.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
Steps to Reproduce
Here is the pip installations:
Here is the Code:
The text was updated successfully, but these errors were encountered: