Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: llama-box crashes when setting --ctx-size to 0 #21

Closed
n00b001 opened this issue Jan 5, 2025 · 11 comments
Closed

Bug: llama-box crashes when setting --ctx-size to 0 #21

n00b001 opened this issue Jan 5, 2025 · 11 comments

Comments

@n00b001
Copy link

n00b001 commented Jan 5, 2025

Hello

Summary of issue:
I've mad a python program that has quite a lot of complexity - and from time to time I see llama-box crash. I'm trying to narrow down the reason as to why this might happen. So this may be it.

Expectation:
When I set "-c" to "0", I expect the context window of the model to be used.
When a prompt is sent that is larger than the context window, I expect it to be truncated to the size of the context window, where 'older' tokens are discarded (unless "--no-context-shift" is used)

What actually happens:
I set "-c" to "0", and send a large prompt, llama-box crashes

System specs:

  • 64 GB RAM
  • RTX 4090
  • 7950x3D
  • llama-box.exe --version: v0.0.103 (568736f)
  • llama-box.exe --version: vendor : llama.cpp b56f079e (4418), stable-diffusion.cpp 01fec2a (197)

When this happens, CPU/RAM/VRAM are all OK - so it doesn't look like an out of memory (OOM) error

Here is the minimum needed to reproduce the issue:

I am running llama-box with this command:
llama-box.exe --port 8082 -c 0 -np 2 --host 0.0.0.0 -m "models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf" --mmproj "models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf"

model: https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF/blob/main/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf
mmproj: https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF/blob/main/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf

I am sending the server this command:
curl http://localhost:8082/v1/chat/completions -H "Content-Type: application/json" -d "@lots_of_ones.txt"

And the file 'lots_of_ones.txt' contains 1,638,400 occurences of the character: '1' (along with a little JSON):
{"model": "hermes2", "messages": [{"role":"user", "content": "1[...]1"}]}

Output from llama-box when it crashes:

0.00.024.794 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
0.00.024.799 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.024.799 I ggml_cuda_init: found 1 CUDA devices:
0.00.024.808 I   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
0.00.025.977 I
0.00.025.991 I arguments  : .\llama-box.exe --port 8082 -c 0 -np 2 --host 0.0.0.0 -m models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf --mmproj models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf --temp 0
0.00.025.992 I version    : v0.0.103 (568736f)
0.00.025.992 I compiler   : unknown
0.00.025.992 I target     : unknown
0.00.025.993 I vendor     : llama.cpp b56f079e (4418), stable-diffusion.cpp 01fec2a (197)
0.00.026.017 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |
0.00.026.018 I
0.00.026.815 I srv                      main: listening, hostname = 0.0.0.0, port = 8082, n_threads = 4 + 2
0.00.038.864 I srv                      main: loading model
0.00.038.872 I srv                load_model: loading model 'models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf'
0.00.038.887 W srv                load_model: n_ctx is too small for multimodal projection, setting to 2048
0.00.039.576 I clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf
0.00.039.582 I clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.039.586 I clip_model_load: - kv   0:                       general.architecture str              = clip
0.00.039.587 I clip_model_load: - kv   1:                        general.description str              = image encoder for Qwen2VL
0.00.039.590 I clip_model_load: - kv   2:                          general.file_type u32              = 1
0.00.039.590 I clip_model_load: - kv   3:                      clip.has_text_encoder bool             = false
0.00.039.591 I clip_model_load: - kv   4:                    clip.has_vision_encoder bool             = true
0.00.039.591 I clip_model_load: - kv   5:                    clip.has_qwen2vl_merger bool             = true
0.00.039.592 I clip_model_load: - kv   6:                        clip.projector_type str              = qwen2vl_merger
0.00.039.592 I clip_model_load: - kv   7:                              clip.use_silu bool             = false
0.00.039.592 I clip_model_load: - kv   8:                              clip.use_gelu bool             = false
0.00.039.593 I clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
0.00.039.593 I clip_model_load: - kv  10:                     clip.vision.image_size u32              = 560
0.00.039.593 I clip_model_load: - kv  11:               clip.vision.embedding_length u32              = 1280
0.00.039.594 I clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 3584
0.00.039.594 I clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
0.00.039.605 I clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
0.00.039.605 I clip_model_load: - kv  15:                    clip.vision.block_count u32              = 32
0.00.039.606 I clip_model_load: - kv  16:            clip.vision.feed_forward_length u32              = 0
0.00.039.607 I clip_model_load: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct-abliterated
0.00.039.617 I clip_model_load: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
0.00.039.620 I clip_model_load: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
0.00.039.620 I clip_model_load: - type  f32:  325 tensors
0.00.039.620 I clip_model_load: - type  f16:  196 tensors
0.00.040.113 I clip_model_load: CLIP using CUDA backend
0.00.040.114 W clip_model_load: Main model doesn't offload, fallback to CPU backend
0.00.040.116 I clip_model_load: params backend buffer size =  1289.95 MB (521 tensors)
0.00.814.700 E key clip.vision.image_grid_pinpoints not found in file
0.00.814.751 E key clip.vision.mm_patch_merge_type not found in file
0.00.814.757 E key clip.vision.image_crop_resolution not found in file
0.00.815.734 I clip_model_load: compute allocated memory: 198.93 MiB
0.00.881.376 I llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
0.00.902.013 I llama_model_loader: loaded meta data with 38 key-value pairs and 339 tensors from models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf (version GGUF V3 (latest))
0.00.902.023 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.902.025 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
0.00.902.025 I llama_model_loader: - kv   1:                               general.type str              = model
0.00.902.026 I llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 7B Instruct Abliterated
0.00.902.027 I llama_model_loader: - kv   3:                           general.finetune str              = Instruct-abliterated
0.00.902.027 I llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
0.00.902.028 I llama_model_loader: - kv   5:                         general.size_label str              = 7B
0.00.902.028 I llama_model_loader: - kv   6:                            general.license str              = apache-2.0
0.00.902.029 I llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/huihui-ai/Qwen...
0.00.902.030 I llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
0.00.902.030 I llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2 VL 7B Instruct
0.00.902.031 I llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
0.00.902.032 I llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-...
0.00.902.041 I llama_model_loader: - kv  12:                               general.tags arr[str,4]       = ["chat", "abliterated", "uncensored",...
0.00.902.042 I llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
0.00.902.043 I llama_model_loader: - kv  14:                        qwen2vl.block_count u32              = 28
0.00.902.043 I llama_model_loader: - kv  15:                     qwen2vl.context_length u32              = 32768
0.00.902.044 I llama_model_loader: - kv  16:                   qwen2vl.embedding_length u32              = 3584
0.00.902.044 I llama_model_loader: - kv  17:                qwen2vl.feed_forward_length u32              = 18944
0.00.902.044 I llama_model_loader: - kv  18:               qwen2vl.attention.head_count u32              = 28
0.00.902.045 I llama_model_loader: - kv  19:            qwen2vl.attention.head_count_kv u32              = 4
0.00.902.048 I llama_model_loader: - kv  20:                     qwen2vl.rope.freq_base f32              = 1000000.000000
0.00.902.050 I llama_model_loader: - kv  21:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
0.00.902.050 I llama_model_loader: - kv  22:                          general.file_type u32              = 18
0.00.902.051 I llama_model_loader: - kv  23:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
0.00.902.052 I llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
0.00.902.052 I llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
0.00.923.788 I llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.932.652 I llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.954.468 I llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
0.00.954.471 I llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
0.00.954.471 I llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151645
0.00.954.472 I llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
0.00.954.474 I llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
0.00.954.475 I llama_model_loader: - kv  33:               general.quantization_version u32              = 2
0.00.954.476 I llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-7B-Instruct-abli...
0.00.954.477 I llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
0.00.954.477 I llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 196
0.00.954.478 I llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 128
0.00.954.478 I llama_model_loader: - type  f32:  141 tensors
0.00.954.479 I llama_model_loader: - type q8_0:    2 tensors
0.00.954.479 I llama_model_loader: - type q6_K:  196 tensors
0.01.041.321 I llm_load_vocab: special tokens cache size = 14
0.01.060.136 I llm_load_vocab: token to piece cache size = 0.9309 MB
0.01.060.148 I llm_load_print_meta: format           = GGUF V3 (latest)
0.01.060.148 I llm_load_print_meta: arch             = qwen2vl
0.01.060.148 I llm_load_print_meta: vocab type       = BPE
0.01.060.149 I llm_load_print_meta: n_vocab          = 152064
0.01.060.149 I llm_load_print_meta: n_merges         = 151387
0.01.060.149 I llm_load_print_meta: vocab_only       = 0
0.01.060.150 I llm_load_print_meta: n_ctx_train      = 32768
0.01.060.150 I llm_load_print_meta: n_embd           = 3584
0.01.060.150 I llm_load_print_meta: n_layer          = 28
0.01.060.159 I llm_load_print_meta: n_head           = 28
0.01.060.160 I llm_load_print_meta: n_head_kv        = 4
0.01.060.160 I llm_load_print_meta: n_rot            = 128
0.01.060.161 I llm_load_print_meta: n_swa            = 0
0.01.060.161 I llm_load_print_meta: n_embd_head_k    = 128
0.01.060.161 I llm_load_print_meta: n_embd_head_v    = 128
0.01.060.162 I llm_load_print_meta: n_gqa            = 7
0.01.060.163 I llm_load_print_meta: n_embd_k_gqa     = 512
0.01.060.164 I llm_load_print_meta: n_embd_v_gqa     = 512
0.01.060.165 I llm_load_print_meta: f_norm_eps       = 0.0e+00
0.01.060.166 I llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
0.01.060.167 I llm_load_print_meta: f_clamp_kqv      = 0.0e+00
0.01.060.167 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00
0.01.060.167 I llm_load_print_meta: f_logit_scale    = 0.0e+00
0.01.060.168 I llm_load_print_meta: n_ff             = 18944
0.01.060.169 I llm_load_print_meta: n_expert         = 0
0.01.060.169 I llm_load_print_meta: n_expert_used    = 0
0.01.060.169 I llm_load_print_meta: causal attn      = 1
0.01.060.169 I llm_load_print_meta: pooling type     = 0
0.01.060.170 I llm_load_print_meta: rope type        = 8
0.01.060.170 I llm_load_print_meta: rope scaling     = linear
0.01.060.171 I llm_load_print_meta: freq_base_train  = 1000000.0
0.01.060.172 I llm_load_print_meta: freq_scale_train = 1
0.01.060.172 I llm_load_print_meta: n_ctx_orig_yarn  = 32768
0.01.060.172 I llm_load_print_meta: rope_finetuned   = unknown
0.01.060.172 I llm_load_print_meta: ssm_d_conv       = 0
0.01.060.173 I llm_load_print_meta: ssm_d_inner      = 0
0.01.060.173 I llm_load_print_meta: ssm_d_state      = 0
0.01.060.173 I llm_load_print_meta: ssm_dt_rank      = 0
0.01.060.173 I llm_load_print_meta: ssm_dt_b_c_rms   = 0
0.01.060.174 I llm_load_print_meta: model type       = 7B
0.01.060.174 I llm_load_print_meta: model ftype      = Q6_K
0.01.060.175 I llm_load_print_meta: model params     = 7.62 B
0.01.060.176 I llm_load_print_meta: model size       = 6.06 GiB (6.84 BPW)
0.01.060.176 I llm_load_print_meta: general.name     = Qwen2 VL 7B Instruct Abliterated
0.01.060.177 I llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
0.01.060.177 I llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
0.01.060.177 I llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
0.01.060.178 I llm_load_print_meta: PAD token        = 151645 '<|im_end|>'
0.01.060.178 I llm_load_print_meta: LF token         = 148848 'ÄĬ'
0.01.060.178 I llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
0.01.060.178 I llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
0.01.060.179 I llm_load_print_meta: max token length = 256
0.01.393.720 I llm_load_tensors: offloading 0 repeating layers to GPU
0.01.393.724 I llm_load_tensors: offloaded 0/29 layers to GPU
0.01.393.731 I llm_load_tensors:   CPU_Mapped model buffer size =  6210.54 MiB
.....................................................................................
0.01.400.120 I common_init_from_params: model requires M-RoPE, increasing batch size by 4x
0.01.400.128 I llama_new_context_with_model: n_seq_max     = 2
0.01.400.129 I llama_new_context_with_model: n_ctx         = 2048
0.01.400.129 I llama_new_context_with_model: n_ctx_per_seq = 1024
0.01.400.129 I llama_new_context_with_model: n_batch       = 2048
0.01.400.130 I llama_new_context_with_model: n_ubatch      = 512
0.01.400.130 I llama_new_context_with_model: flash_attn    = 0
0.01.400.131 I llama_new_context_with_model: freq_base     = 1000000.0
0.01.400.132 I llama_new_context_with_model: freq_scale    = 1
0.01.400.134 W llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.01.400.142 I llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
0.01.409.585 I llama_kv_cache_init:        CPU KV buffer size =   112.00 MiB
0.01.409.590 I llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
0.01.409.759 I llama_new_context_with_model:        CPU  output buffer size =     1.16 MiB
0.01.414.110 I llama_new_context_with_model:      CUDA0 compute buffer size =   856.23 MiB
0.01.414.114 I llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
0.01.414.115 I llama_new_context_with_model: graph nodes  = 986
0.01.414.115 I llama_new_context_with_model: graph splits = 396 (with bs=512), 1 (with bs=1)
0.01.414.117 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
0.01.414.117 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.406.722 I srv                load_model: prompt caching disabled
0.02.407.284 I srv                load_model: chat template, built_in: true, alias: chatml, tool call: supported, example:
<|im_start|>system
You are a helpful assistant.

## Tools

You CAN call functions to assist with the user query. Do not make assumptions about what values to plug into functions.

You are provided with following function tools:

### get_weather

get_weather:  Parameters: {"type":"object","properties":{"location":{"type":"string"}}}Format the arguments as a JSON object.

### get_temperature

get_temperature: Return the temperature according to the location. Parameters: {"type":"object","properties":{"location":{"type":"string"}}}Format the arguments as a JSON object.

When you can reply with your internal knowledge, reply directly without any function calls. Otherwise, for each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": The name of the function to use, "arguments": The input of the function, must be a JSON object in compact format}
</tool_call>
<tool_result>
The function results.
</tool_result>
Reply based on the function results.<|im_end|>
<|im_start|>user
Hello.<|im_end|>
Hi there.<|im_end|>
<|im_start|>user
What's the weather like in Paris today?<|im_end|>
<|im_start|>assistant

0.02.407.287 I srv                      main: initializing server
0.02.407.289 I srv                      init: initializing slots, n_slots = 2
0.02.407.414 I srv                      main: starting server
0.35.185.155 I srv        log_server_request: rid 34369410 | POST /v1/chat/completions 127.0.0.1:60400
0.35.205.825 I srv oaicompat_completions_req: rid 34369410 | {"messages":"[...]","model":"hermes2"}
0.36.236.689 W slt              update_slots: rid 34369410 | id 00 | task 0 | input truncated, n_ctx = 1024, n_keep = 0, n_left = 1024, n_prompt_tokens = 520
1.23.432.412 W slt              update_slots: rid 34369410 | id 00 | task 0 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
D:\a\llama-box\llama-box\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c:9441: GGML_ASSERT(sections[0] > 0 || sections[1] > 0 || sections[2] > 0) failed
D:\a\llama-box\llama-box\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c:9441: GGML_ASSERT(sections[0] > 0 || sections[1] > 0 || sections[2] > 0) failed```
@n00b001
Copy link
Author

n00b001 commented Jan 5, 2025

Update:
Running with the default context window in the README also has this issue:
-c 8192

And sending 4,096 characters also causes this issue (file attached)

Here's some other example files for testing:
2048_ones.txt
4096_ones.txt
8192_ones.txt
16384_ones.txt
32768_ones.txt
lots_of_ones.txt

@n00b001
Copy link
Author

n00b001 commented Jan 5, 2025

Update: setting the context size to 4096 seems to workaround this issue -c 4096

However, my original issue that caused this investigation is still happening, so I will keep digging...

Regarding this issue, I do believe it is still an issue. Setting the context window higher/lower should impact the stability of llama-box. If the user sets it to something unsupported, the program should exit with an informative error at startup.

@thxCode
Copy link
Collaborator

thxCode commented Jan 7, 2025

  1. when we start box with -c 8192 -np 2, the box will create 2 slots, each holding 8192/2=4096 context size. if we need to use the whole training context size of the model, we should keep -np to 1.
  2. for non-darwin hosts, if we start box without -ngl, it will use CPU offloading.
  3. to check if truncate the old tokens or not, start box with --verbose, and some following logs can be found(start box with -c 2048 -np 2 --verbose, chat with 2048_ones.txt):
0.11.049.219 I slt              update_slots: rid 754660644378 | id 00 | task 0 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 2056
0.29.168.672 W slt              update_slots: rid 754660644378 | id 00 | task 0 | input truncated, n_ctx = 1024, n_keep = 0, n_left = 1024, n_prompt_tokens = 520
  1. i cannot reproduce the error in MacOS (using pure CPU offloading) as Bug: llama-box crashes when setting --ctx-size to 0 #21 (comment) described, but it looks like the soft max returns -inf. a mrope processing causes this, i guess. i will try to fix it in the next version.

@thxCode
Copy link
Collaborator

thxCode commented Jan 8, 2025

please test with v0.0.104.

@n00b001 n00b001 closed this as completed Jan 9, 2025
@n00b001 n00b001 reopened this Jan 9, 2025
@n00b001
Copy link
Author

n00b001 commented Jan 9, 2025

Running again with the latest version, sending that huge 1.6M '1's file.

server command: llama-box.exe --port 8082 -c 0 -np 2 --host 0.0.0.0 -m "models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf" --mmproj "models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf" --gpu-layers 33 --verbose

<lots of logs before this point>
0.39.191.912 D srv              update_slots: decoding batch, n_tokens = 1
0.39.200.552 D que                start_loop: waiting for new tasks
0.39.200.556 D que                start_loop: processing new tasks
0.39.200.557 D que                start_loop: processing task, id = -1
0.39.200.558 D que                start_loop: update slots
0.39.200.559 D que                      post: new task, id = -1, front = 0
0.39.200.560 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1017, n_cache_tokens = 0, truncated = 1
0.39.200.561 D srv              update_slots: decoding batch, n_tokens = 1
0.39.209.202 D que                start_loop: waiting for new tasks
0.39.209.206 D que                start_loop: processing new tasks
0.39.209.207 D que                start_loop: processing task, id = -1
0.39.209.208 D que                start_loop: update slots
0.39.209.209 D que                      post: new task, id = -1, front = 0
0.39.209.210 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1018, n_cache_tokens = 0, truncated = 1
0.39.209.211 D srv              update_slots: decoding batch, n_tokens = 1
0.39.217.815 D que                start_loop: waiting for new tasks
0.39.217.819 D que                start_loop: processing new tasks
0.39.217.820 D que                start_loop: processing task, id = -1
0.39.217.821 D que                start_loop: update slots
0.39.217.822 D que                      post: new task, id = -1, front = 0
0.39.217.823 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1019, n_cache_tokens = 0, truncated = 1
0.39.217.824 D srv              update_slots: decoding batch, n_tokens = 1
0.39.226.436 D que                start_loop: waiting for new tasks
0.39.226.439 D que                start_loop: processing new tasks
0.39.226.440 D que                start_loop: processing task, id = -1
0.39.226.441 D que                start_loop: update slots
0.39.226.442 D que                      post: new task, id = -1, front = 0
0.39.226.443 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1020, n_cache_tokens = 0, truncated = 1
0.39.226.444 D srv              update_slots: decoding batch, n_tokens = 1
0.39.235.095 D que                start_loop: waiting for new tasks
0.39.235.099 D que                start_loop: processing new tasks
0.39.235.100 D que                start_loop: processing task, id = -1
0.39.235.101 D que                start_loop: update slots
0.39.235.102 D que                      post: new task, id = -1, front = 0
0.39.235.103 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1021, n_cache_tokens = 0, truncated = 1
0.39.235.104 D srv              update_slots: decoding batch, n_tokens = 1
0.39.243.911 D que                start_loop: waiting for new tasks
0.39.243.916 D que                start_loop: processing new tasks
0.39.243.917 D que                start_loop: processing task, id = -1
0.39.243.917 D que                start_loop: update slots
0.39.243.918 D que                      post: new task, id = -1, front = 0
0.39.243.921 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1022, n_cache_tokens = 0, truncated = 1
0.39.243.921 D srv              update_slots: decoding batch, n_tokens = 1
0.39.252.774 D que                start_loop: waiting for new tasks
0.39.252.778 D que                start_loop: processing new tasks
0.39.252.779 D que                start_loop: processing task, id = -1
0.39.252.780 D que                start_loop: update slots
0.39.252.781 D que                      post: new task, id = -1, front = 0
0.39.252.783 D slt              update_slots: rid 32636448 | id 00 | task 0 | slot decode token, n_ctx = 1024, n_past = 1023, n_cache_tokens = 0, truncated = 1
0.39.252.783 D srv              update_slots: decoding batch, n_tokens = 1
0.39.261.517 D que                start_loop: waiting for new tasks
0.39.261.521 D que                start_loop: processing new tasks
D:\a\llama-box\llama-box\llama.cpp\ggml\src\ggml-cuda\rope.cu:423: GGML_ASSERT(sections.v[0] > 0 || sections.v[1] > 0 || sections.v[2] > 0) failed
0.39.261.522 D

@Finenyaco
Copy link

This issue can be reproduced by using the same test steps in llama-box v0.0.106. And it doesn't occur every time

root@cu3h5fcp420c739ad990-g9ipx:~/workspace# /opt/conda/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --version
version    : v0.0.106 (ccf473b)
compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
target     : x86_64-redhat-linux
vendor     : llama.cpp b4d92a59 (4484), stable-diffusion.cpp fbbb5ee (203)
root@cu3h5fcp420c739ad990-g9ipx:~/workspace# /opt/conda/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --embeddings --gpu-layers 29 --model /root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/Qwen2-VL-7B-Instruct-Q6_K_L.gguf --alias qwen2-vl --no-mmap --no-warmup --mmproj /root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/mmproj-Qwen2-VL-7B-Instruct-f32.gguf -c 0
0.00.001.630 I
0.00.001.674 I arguments  : /opt/conda/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --embeddings --gpu-layers 29 --model /root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/Qwen2-VL-7B-Instruct-Q6_K_L.gguf --alias qwen2-vl --no-mmap --no-warmup --mmproj /root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/mmproj-Qwen2-VL-7B-Instruct-f32.gguf -c 0
0.00.001.675 I version    : v0.0.106 (ccf473b)
0.00.001.676 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.00.001.676 I target     : x86_64-redhat-linux
0.00.001.677 I vendor     : llama.cpp b4d92a59 (4484), stable-diffusion.cpp fbbb5ee (203)
0.00.029.109 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
0.00.029.116 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.029.116 I ggml_cuda_init: found 1 CUDA devices:
0.00.029.411 I   Device 0: Tesla P40, compute capability 6.1, VMM: yes
0.00.048.728 I system_info: n_threads = 28 (n_threads_batch = 28) / 56 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |
0.00.048.734 I
0.00.048.871 I srv                      main: listening, hostname = 0.0.0.0, port = 8080, n_threads = 3 + 2
0.00.050.081 I srv                      main: loading model
0.00.050.085 I srv                load_model: loading model '/root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/Qwen2-VL-7B-Instruct-Q6_K_L.gguf'
0.00.050.092 W srv                load_model: n_ctx is too small for multimodal projection, setting to 2048
0.00.051.660 I clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from /root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/mmproj-Qwen2-VL-7B-Instruct-f32.gguf
0.00.051.670 I clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.051.688 I clip_model_load: - kv   0:                       general.architecture str              = clip
0.00.051.693 I clip_model_load: - kv   1:                        general.description str              = image encoder for Qwen2VL
0.00.051.696 I clip_model_load: - kv   2:                          general.file_type u32              = 0
0.00.051.698 I clip_model_load: - kv   3:                      clip.has_text_encoder bool             = false
0.00.051.699 I clip_model_load: - kv   4:                    clip.has_vision_encoder bool             = true
0.00.051.700 I clip_model_load: - kv   5:                    clip.has_qwen2vl_merger bool             = true
0.00.051.702 I clip_model_load: - kv   6:                        clip.projector_type str              = qwen2vl_merger
0.00.051.704 I clip_model_load: - kv   7:                              clip.use_silu bool             = false
0.00.051.705 I clip_model_load: - kv   8:                              clip.use_gelu bool             = false
0.00.051.706 I clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
0.00.051.707 I clip_model_load: - kv  10:                     clip.vision.image_size u32              = 560
0.00.051.709 I clip_model_load: - kv  11:               clip.vision.embedding_length u32              = 1280
0.00.051.710 I clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 3584
0.00.051.713 I clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
0.00.051.723 I clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
0.00.051.724 I clip_model_load: - kv  15:                    clip.vision.block_count u32              = 32
0.00.051.725 I clip_model_load: - kv  16:            clip.vision.feed_forward_length u32              = 0
0.00.051.727 I clip_model_load: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
0.00.051.738 I clip_model_load: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
0.00.051.741 I clip_model_load: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
0.00.051.742 I clip_model_load: - type  f32:  521 tensors
0.00.052.669 I clip_model_load: CLIP using CUDA backend
0.00.052.677 I clip_model_load: params backend buffer size =  2577.82 MB (521 tensors)
0.02.587.753 E key clip.vision.image_grid_pinpoints not found in file
0.02.587.857 E key clip.vision.mm_patch_merge_type not found in file
0.02.587.868 E key clip.vision.image_crop_resolution not found in file
0.02.590.900 I clip_model_load: compute allocated memory: 198.93 MiB
0.02.590.968 I llama_model_load_from_file: using device CUDA0 (Tesla P40) - 20946 MiB free
0.02.672.240 I llama_model_loader: loaded meta data with 37 key-value pairs and 339 tensors from /root/workspace/data/cache/model_scope/bartowski/Qwen2-VL-7B-Instruct-GGUF/Qwen2-VL-7B-Instruct-Q6_K_L.gguf (version GGUF V3 (latest))
0.02.672.260 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.02.672.265 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
0.02.672.267 I llama_model_loader: - kv   1:                               general.type str              = model
0.02.672.269 I llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 7B Instruct
0.02.672.270 I llama_model_loader: - kv   3:                           general.finetune str              = Instruct
0.02.672.272 I llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
0.02.672.273 I llama_model_loader: - kv   5:                         general.size_label str              = 7B
0.02.672.275 I llama_model_loader: - kv   6:                            general.license str              = apache-2.0
0.02.672.278 I llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
0.02.672.280 I llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 VL 7B
0.02.672.281 I llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
0.02.672.284 I llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-7B
0.02.672.306 I llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
0.02.672.310 I llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
0.02.672.312 I llama_model_loader: - kv  13:                        qwen2vl.block_count u32              = 28
0.02.672.313 I llama_model_loader: - kv  14:                     qwen2vl.context_length u32              = 32768
0.02.672.316 I llama_model_loader: - kv  15:                   qwen2vl.embedding_length u32              = 3584
0.02.672.318 I llama_model_loader: - kv  16:                qwen2vl.feed_forward_length u32              = 18944
0.02.672.320 I llama_model_loader: - kv  17:               qwen2vl.attention.head_count u32              = 28
0.02.672.321 I llama_model_loader: - kv  18:            qwen2vl.attention.head_count_kv u32              = 4
0.02.672.329 I llama_model_loader: - kv  19:                     qwen2vl.rope.freq_base f32              = 1000000.000000
0.02.672.332 I llama_model_loader: - kv  20:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
0.02.672.334 I llama_model_loader: - kv  21:                          general.file_type u32              = 18
0.02.672.340 I llama_model_loader: - kv  22:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
0.02.672.341 I llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
0.02.672.343 I llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
0.02.727.931 I llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.02.750.406 I llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.02.808.739 I llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.02.808.749 I llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
0.02.808.751 I llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
0.02.808.752 I llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
0.02.808.756 I llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
0.02.808.757 I llama_model_loader: - kv  32:               general.quantization_version u32              = 2
0.02.808.759 I llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-7B-Instruct-GGUF...
0.02.808.761 I llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
0.02.808.763 I llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 196
0.02.808.764 I llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 128
0.02.808.765 I llama_model_loader: - type  f32:  141 tensors
0.02.808.766 I llama_model_loader: - type q8_0:    2 tensors
0.02.808.766 I llama_model_loader: - type q6_K:  196 tensors
0.02.808.768 I print_info: file format = GGUF V3 (latest)
0.02.808.769 I print_info: file type   = Q6_K
0.02.808.779 I print_info: file size   = 6.06 GiB (6.84 BPW)
0.03.096.180 I load: special tokens cache size = 14
0.03.346.370 I load: token to piece cache size = 0.9309 MB
0.03.346.394 I print_info: arch             = qwen2vl
0.03.346.395 I print_info: vocab_only       = 0
0.03.346.396 I print_info: n_ctx_train      = 32768
0.03.346.397 I print_info: n_embd           = 3584
0.03.346.398 I print_info: n_layer          = 28
0.03.346.411 I print_info: n_head           = 28
0.03.346.414 I print_info: n_head_kv        = 4
0.03.346.415 I print_info: n_rot            = 128
0.03.346.415 I print_info: n_swa            = 0
0.03.346.417 I print_info: n_embd_head_k    = 128
0.03.346.417 I print_info: n_embd_head_v    = 128
0.03.346.419 I print_info: n_gqa            = 7
0.03.346.422 I print_info: n_embd_k_gqa     = 512
0.03.346.424 I print_info: n_embd_v_gqa     = 512
0.03.346.429 I print_info: f_norm_eps       = 0.0e+00
0.03.346.431 I print_info: f_norm_rms_eps   = 1.0e-06
0.03.346.433 I print_info: f_clamp_kqv      = 0.0e+00
0.03.346.433 I print_info: f_max_alibi_bias = 0.0e+00
0.03.346.434 I print_info: f_logit_scale    = 0.0e+00
0.03.346.437 I print_info: n_ff             = 18944
0.03.346.438 I print_info: n_expert         = 0
0.03.346.439 I print_info: n_expert_used    = 0
0.03.346.439 I print_info: causal attn      = 1
0.03.346.440 I print_info: pooling type     = 0
0.03.346.441 I print_info: rope type        = 8
0.03.346.443 I print_info: rope scaling     = linear
0.03.346.446 I print_info: freq_base_train  = 1000000.0
0.03.346.447 I print_info: freq_scale_train = 1
0.03.346.448 I print_info: n_ctx_orig_yarn  = 32768
0.03.346.450 I print_info: rope_finetuned   = unknown
0.03.346.451 I print_info: ssm_d_conv       = 0
0.03.346.451 I print_info: ssm_d_inner      = 0
0.03.346.452 I print_info: ssm_d_state      = 0
0.03.346.452 I print_info: ssm_dt_rank      = 0
0.03.346.453 I print_info: ssm_dt_b_c_rms   = 0
0.03.346.454 I print_info: model type       = 7B
0.03.346.455 I print_info: model params     = 7.62 B
0.03.346.456 I print_info: general.name     = Qwen2 VL 7B Instruct
0.03.346.459 I print_info: vocab type       = BPE
0.03.346.460 I print_info: n_vocab          = 152064
0.03.346.461 I print_info: n_merges         = 151387
0.03.346.462 I print_info: BOS token        = 151643 '<|endoftext|>'
0.03.346.462 I print_info: EOS token        = 151645 '<|im_end|>'
0.03.346.462 I print_info: EOT token        = 151645 '<|im_end|>'
0.03.346.465 I print_info: PAD token        = 151643 '<|endoftext|>'
0.03.346.466 I print_info: LF token         = 148848 'ÄĬ'
0.03.346.466 I print_info: EOG token        = 151643 '<|endoftext|>'
0.03.346.467 I print_info: EOG token        = 151645 '<|im_end|>'
0.03.346.468 I print_info: max token length = 256
0.03.628.119 I load_tensors: offloading 28 repeating layers to GPU
0.03.628.128 I load_tensors: offloading output layer to GPU
0.03.628.129 I load_tensors: offloaded 29/29 layers to GPU
0.03.628.137 I load_tensors:    CUDA_Host model buffer size =   552.23 MiB
0.03.628.138 I load_tensors:        CUDA0 model buffer size =  5658.31 MiB
0.06.831.479 I common_init_from_params: model requires M-RoPE, increasing batch size by 4x
0.06.831.522 I llama_init_from_model: n_seq_max     = 1
0.06.831.523 I llama_init_from_model: n_ctx         = 2048
0.06.831.524 I llama_init_from_model: n_ctx_per_seq = 2048
0.06.831.525 I llama_init_from_model: n_batch       = 2048
0.06.831.525 I llama_init_from_model: n_ubatch      = 512
0.06.831.526 I llama_init_from_model: flash_attn    = 0
0.06.831.532 I llama_init_from_model: freq_base     = 1000000.0
0.06.831.533 I llama_init_from_model: freq_scale    = 1
0.06.831.536 W llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.06.831.558 I llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
0.06.832.474 I llama_kv_cache_init:      CUDA0 KV buffer size =   112.00 MiB
0.06.832.483 I llama_init_from_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
0.06.833.763 I llama_init_from_model:  CUDA_Host  output buffer size =     0.01 MiB
0.06.843.331 I llama_init_from_model:      CUDA0 compute buffer size =   304.00 MiB
0.06.843.339 I llama_init_from_model:  CUDA_Host compute buffer size =    11.01 MiB
0.06.843.340 I llama_init_from_model: graph nodes  = 986
0.06.843.341 I llama_init_from_model: graph splits = 2
0.06.843.343 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
0.06.843.355 I srv                load_model: prompt caching disabled
0.06.850.127 I srv                load_model: chat template, built_in: true, alias: chatml, tool call: supported, example:
<|im_start|>system
You are a helpful assistant.

## Tools

You CAN call functions to assist with the user query. Do not make assumptions about what values to plug into functions.

You are provided with following function tools:

### get_weather

get_weather:  Parameters: {"type":"object","properties":{"location":{"type":"string"}}}Format the arguments as a JSON object.

### get_temperature

get_temperature: Return the temperature according to the location. Parameters: {"type":"object","properties":{"location":{"type":"string"}}}Format the arguments as a JSON object.

When you can reply with your internal knowledge, reply directly without any function calls. Otherwise, for each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": The name of the function to use, "arguments": The input of the function, must be a JSON object in compact format}
</tool_call>
<tool_result>
The function results.
</tool_result>
Reply based on the function results.<|im_end|>
<|im_start|>user
Hello.<|im_end|>
Hi there.<|im_end|>
<|im_start|>user
What's the weather like in Paris today?<|im_end|>
<|im_start|>assistant

0.06.850.439 I srv                      main: initializing server
0.06.850.485 I srv                      init: initializing slots, n_slots = 1
0.06.850.730 I srv                      main: starting server
0.14.314.824 I srv        log_server_request: rid 12795911615066 | POST /v1/chat/completions 127.0.0.1:59714
0.14.337.583 I srv oaicompat_completions_req: rid 12795911615066 | {"messages":"[...]","model":"hermes2"}
0.15.846.588 W slt              update_slots: rid 12795911615066 | id 00 | task 0 | input truncated, n_ctx = 2048, n_keep = 0, n_left = 2048, n_prompt_tokens = 1032
0.49.729.400 W slt              update_slots: rid 12795911615066 | id 00 | task 0 | slot context shift, n_keep = 0, n_left = 2047, n_discard = 1023
/home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/rope.cu:423: GGML_ASSERT(sections.v[0] > 0 || sections.v[1] > 0 || sections.v[2] > 0) failed

Environment
OS: Ubuntu 22.04
GPU: Tesla P40

@thxCode
Copy link
Collaborator

thxCode commented Jan 16, 2025

a more reasonable reproduce way can be the following steps.

  1. start llama-box with a small context, e.g. -c 2048 -np 4, so each slot takes n_ctx = 512: llama-box --verbosity 3 -c 2048 -np 4 --host 0.0.0.0 --no-warmup --no-mmap -m /home/frank/bartowski/Qwen2-VL-2B-Instruct-GGUF/Qwen2-VL-2B-Instruct-f16.gguf --mmproj /home/frank/bartowski/Qwen2-VL-2B-Instruct-GGUF/mmproj-Qwen2-VL-2B-Instruct-f32.gguf -ngl 99 --visual-max-image-size 1344 --verbosity 3
  2. construct the request data from wikipekia: CONTENT="$(curl https://en.wikipedia.org/w/api.php\?action\=query\&format\=json\&titles\=China\&prop\=extracts\&exintro\&explaintext | jq '.query.pages | to_entries | .[0].value.extract | gsub("\n"; "\\n") | gsub("\t"; "\\t")')"; \ echo "{\"model\": \"mistral-nemo\", \"stream\": true, \"messages\": [{\"role\":\"user\", \"content\": [{\"type\": \"text\", \"text\": \"Please read the following content and summarize the article in 5 sentences.\"}, {\"type\": \"text\", \"text\": "$CONTENT"}]}]}" > /tmp/data.json
  3. chat with the data: chat.sh @/tmp/data.json

this issue happens on the first time long context shifting during decoding, we can get a warning log before crashing: slot context shift, n_keep = 0, n_left = 511, n_discard = 255.

@thxCode
Copy link
Collaborator

thxCode commented Jan 16, 2025

context shifting cause chaos, it should be better to leverage YaRN or customize RoPE to extend the context size without fine-tunning.

ggml-org/llama.cpp#2054 (comment)
ggml-org/llama.cpp#2268 (comment)

@thxCode
Copy link
Collaborator

thxCode commented Jan 16, 2025

@Finenyaco , please tested with v0.0.107.

@n00b001
Copy link
Author

n00b001 commented Jan 17, 2025

@Finenyaco

I have tested with my 1.6M ones file - three times

For all of the tests there was no crash - so this seems fixed for me

@Finenyaco
Copy link

Verified in v0.0.107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants