Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: bartowski/functionary-small-v3.2-GGUF:Q4_K_M model prepends "assistant\n" to text responses when tools are provided #12213

Closed
edmcman opened this issue Mar 5, 2025 · 2 comments

Comments

@edmcman
Copy link

edmcman commented Mar 5, 2025

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4783 (a800ae4)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

i9-13900HX + NVIDIA GeForce RTX 4070

Models

https://huggingface.co/bartowski/functionary-small-v3.2-GGUF/blob/main/functionary-small-v3.2-Q4_K_M.gguf

Problem description & steps to reproduce

docker run --gpus all --rm --name llama.cpp -p 8080:8080 -v /etc/ssl/certs:/etc/ssl/certs:ro -v /home/ed/.llama.cpp/models:/root/.cache ghcr.io/ggml-org/llama.cpp:full-cuda -s --ctx-size 0 --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M --host 0.0.0.0 -ngl 10 --verbose

curl http://localhost:8080/v1/chat/completions -d '{
"model": "gpt-3.5-turbo",
"messages": [
    {
    "role": "user",
    "content": "Hi."
    }
]
}' | jq '.choices[0]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2352  100  2254  100    98   2762    120 --:--:-- --:--:-- --:--:--  2882
{
  "finish_reason": "stop",
  "index": 0,
  "message": {
    "role": "assistant",
    "content": "Hello! How can I help you today?"
  }
}
curl http://localhost:8080/v1/chat/completions -d '{
"model": "gpt-3.5-turbo",
"tools": [
    {
    "type":"function",
    "function":{
        "name":"python",
        "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters":{
        "type":"object",
        "properties":{
            "code":{
            "type":"string",
            "description":"The code to run in the ipython interpreter."
            }
        },
        "required":["code"]
        }
    }
    }
],
"messages": [
    {
    "role": "user",
    "content": "Hi"
    }
]
}' | jq '.choices[0]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4096  100  3527  100   569   4240    684 --:--:-- --:--:-- --:--:--  4923
{
  "finish_reason": "stop",
  "index": 0,
  "message": {
    "role": "assistant",
    "content": "assistant\nHello! How can I assist you today?"
  }
}

First Bad Commit

No response

Relevant log output

+ docker run --gpus all --rm --name llama.cpp -p 8080:8080 -v /etc/ssl/certs:/etc/ssl/certs:ro -v /home/ed/.llama.cpp/models:/root/.cache ghcr.io/ggml-org/llama.cpp:full-cuda -s --ctx-size 0 --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M --host 0.0.0.0 -ngl 10 --verbose
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
build: 4783 (a800ae46) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 32

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 31
main: loading model
srv    load_model: loading model '/root/.cache/llama.cpp/bartowski_functionary-small-v3.2-GGUF_functionary-small-v3.2-Q4_K_M.gguf'
common_download_file: previous metadata file found /root/.cache/llama.cpp/bartowski_functionary-small-v3.2-GGUF_functionary-small-v3.2-Q4_K_M.gguf.json: {"etag":"\"e0ce54ab24981f28174430665c1ed516-308\"","lastModified":"Thu, 08 Aug 2024 09:29:55 GMT","url":"https://huggingface.co/bartowski/functionary-small-v3.2-GGUF/resolve/main/functionary-small-v3.2-Q4_K_M.gguf"}
curl_perform_with_retry: Trying to download from https://huggingface.co/bartowski/functionary-small-v3.2-GGUF/resolve/main/functionary-small-v3.2-Q4_K_M.gguf (attempt 1 of 3)...
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070 Laptop GPU) - 7793 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 292 tensors from /root/.cache/llama.cpp/bartowski_functionary-small-v3.2-GGUF_functionary-small-v3.2-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                            general.version str              = v3.2
llama_model_loader: - kv   4:                       general.organization str              = Meta Llama
llama_model_loader: - kv   5:                           general.finetune str              = Instruct
llama_model_loader: - kv   6:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   7:                         general.size_label str              = 8B
llama_model_loader: - kv   8:                            general.license str              = mit
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /models_out/functionary-small-v3.2-GG...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.58 GiB (4.89 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
load: control token: 128007 '<|end_header_id|>' is not marked as EOG
load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
load: control token: 128001 '<|end_of_text|>' is not marked as EOG
load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
load: control token: 128006 '<|start_header_id|>' is not marked as EOG
load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
load: control token: 128000 '<|begin_of_text|>' is not marked as EOG
load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
load: control token: 128010 '<|python_tag|>' is not marked as EOG
load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU
load_tensors: layer   1 assigned to device CPU
load_tensors: layer   2 assigned to device CPU
load_tensors: layer   3 assigned to device CPU
load_tensors: layer   4 assigned to device CPU
load_tensors: layer   5 assigned to device CPU
load_tensors: layer   6 assigned to device CPU
load_tensors: layer   7 assigned to device CPU
load_tensors: layer   8 assigned to device CPU
load_tensors: layer   9 assigned to device CPU
load_tensors: layer  10 assigned to device CPU
load_tensors: layer  11 assigned to device CPU
load_tensors: layer  12 assigned to device CPU
load_tensors: layer  13 assigned to device CPU
load_tensors: layer  14 assigned to device CPU
load_tensors: layer  15 assigned to device CPU
load_tensors: layer  16 assigned to device CPU
load_tensors: layer  17 assigned to device CPU
load_tensors: layer  18 assigned to device CPU
load_tensors: layer  19 assigned to device CPU
load_tensors: layer  20 assigned to device CPU
load_tensors: layer  21 assigned to device CPU
load_tensors: layer  22 assigned to device CUDA0
load_tensors: layer  23 assigned to device CUDA0
load_tensors: layer  24 assigned to device CUDA0
load_tensors: layer  25 assigned to device CUDA0
load_tensors: layer  26 assigned to device CUDA0
load_tensors: layer  27 assigned to device CUDA0
load_tensors: layer  28 assigned to device CUDA0
load_tensors: layer  29 assigned to device CUDA0
load_tensors: layer  30 assigned to device CUDA0
load_tensors: layer  31 assigned to device CUDA0
load_tensors: layer  32 assigned to device CPU
load_tensors: tensor 'token_embd.weight' (q4_K) (and 222 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/33 layers to GPU
load_tensors:        CUDA0 model buffer size =  1263.13 MiB
load_tensors:   CPU_Mapped model buffer size =  4685.30 MiB
.......................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 131072
llama_init_from_model: n_ctx_per_seq = 131072
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init: kv_size = 131072, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init:      CUDA0 KV buffer size =  5120.00 MiB
llama_kv_cache_init:        CPU KV buffer size = 11264.00 MiB
llama_init_from_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =   920.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   264.01 MiB
llama_init_from_model: graph nodes  = 903
llama_init_from_model: graph splits = 247 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 131072
slot        reset: id  0 | task -1 | 
main: model loaded
main: chat template, chat_template: {% for message in messages %}
{% if message['role'] == 'user' or message['role'] == 'system' %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>

' + message['content'] + '<|eot_id|>' }}{% elif message['role'] == 'tool' %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>

' + message['content'] + '<|eot_id|>' }}{% else %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'}}{% if message['content'] is not none %}
{{ '>>>all
' + message['content'] }}{% endif %}
{% if 'tool_calls' in message and message['tool_calls'] is not none %}
{% for tool_call in message['tool_calls'] %}
{{ '>>>' + tool_call['function']['name'] + '
' + tool_call['function']['arguments'] }}{% endfor %}
{% endif %}
{{ '<|eot_id|>' }}{% endif %}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>{role}<|end_header_id|>

' }}{% endif %}, example_format: '<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

>>>all
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>{role}<|end_header_id|>

'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
que    start_loop: processing new tasks
que    start_loop: update slots
srv  update_slots: all slots are idle
srv  kv_cache_cle: clearing KV cache
que    start_loop: waiting for new tasks
request: {
"model": "gpt-3.5-turbo",
"tools": [
    {
    "type":"function",
    "function":{
        "name":"python",
        "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters":{
        "type":"object",
        "properties":{
            "code":{
            "type":"string",
            "description":"The code to run in the ipython interpreter."
            }
        },
        "required":["code"]
        }
    }
    }
],
"messages": [
    {
    "role": "user",
    "content": "Hi"
    }
]
}
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv  params_from_: Grammar: char ::= [^"\\\x7F\x00-\x1F] | [\\] (["\\bfnrt] | "u" [0-9a-fA-F]{4})
first-tool-call ::= python-call
python-args ::= "{" space python-args-code-kv "}" space
python-args-code-kv ::= "\"code\"" space ":" space string
python-call ::= "python\n" python-args
python-call2 ::= ">>>python\n" python-args
root ::= first-tool-call space
space ::= | " " | "\n" [ \t]{0,20}
string ::= "\"" char* "\"" space

srv  params_from_: Grammar lazy: true
srv  params_from_: Chat format: Functionary v3.2
srv  params_from_: Grammar trigger token: 12958 (`python`)
srv  params_from_: Grammar trigger word: `>>>python`
srv  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  0 | task -1 | selected slot by lru, t_last = -1
slot        reset: id  0 | task -1 | 
slot launch_slot_: id  0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":131072,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nfirst-tool-call ::= python-call\npython-args ::= \"{\" space python-args-code-kv \"}\" space\npython-args-code-kv ::= \"\\\"code\\\"\" space \":\" space string\npython-call ::= \"python\\n\" python-args\npython-call2 ::= \">>>python\\n\" python-args\nroot ::= first-tool-call space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n","grammar_trigger_words":[">>>python"],"grammar_trigger_tokens":[12958],"preserved_tokens":[12958],"chat_format":"Functionary v3.2","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou can call any of the following tools to satisfy the user's requests: [\n  {\n    \"type\": \"function\",\n    \"function\": {\n      \"name\": \"python\",\n      \"description\": \"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.\",\n      \"parameters\": {\n        \"type\": \"object\",\n        \"properties\": {\n          \"code\": {\n            \"type\": \"string\",\n            \"description\": \"The code to run in the ipython interpreter.\"\n          }\n        },\n        \"required\": [\n          \"code\"\n        ]\n      }\n    }\n  }\n]\n\nExample tool call syntax:\n\nassistant<|end_header_id|>\n\n>>>tool_name\n{\"arg1\": \"some_value\"}<|eot_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi<|eot_id|><|start_header_id|>{role}<|end_header_id|>\n\n","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
slot launch_slot_: id  0 | task 0 | processing task
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 167
slot update_slots: id  0 | task 0 | prompt token   0: 128000 '<|begin_of_text|>'
slot update_slots: id  0 | task 0 | prompt token   1: 128006 '<|start_header_id|>'
slot update_slots: id  0 | task 0 | prompt token   2:   9125 'system'
slot update_slots: id  0 | task 0 | prompt token   3: 128007 '<|end_header_id|>'
slot update_slots: id  0 | task 0 | prompt token   4:    271 '

'
slot update_slots: id  0 | task 0 | prompt token   5:   2675 'You'
slot update_slots: id  0 | task 0 | prompt token   6:    649 ' can'
slot update_slots: id  0 | task 0 | prompt token   7:   1650 ' call'
slot update_slots: id  0 | task 0 | prompt token   8:    904 ' any'
slot update_slots: id  0 | task 0 | prompt token   9:    315 ' of'
slot update_slots: id  0 | task 0 | prompt token  10:    279 ' the'
slot update_slots: id  0 | task 0 | prompt token  11:   2768 ' following'
slot update_slots: id  0 | task 0 | prompt token  12:   7526 ' tools'
slot update_slots: id  0 | task 0 | prompt token  13:    311 ' to'
slot update_slots: id  0 | task 0 | prompt token  14:  27651 ' satisfy'
slot update_slots: id  0 | task 0 | prompt token  15:    279 ' the'
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 167, n_tokens = 167, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 167, n_tokens = 167
srv  update_slots: decoding batch, n_tokens = 167
Grammar still awaiting trigger after token 78191 (`assistant`)
slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = -1, next token: 78191 'assistant'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 168, n_cache_tokens = 168, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 198 (`
`)
slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = -1, next token:   198 '
'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 2
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 3, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 169, n_cache_tokens = 169, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 9906 (`Hello`)
slot process_toke: id  0 | task 0 | n_decoded = 3, n_remaining = -1, next token:  9906 'Hello'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 3
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 4, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 170, n_cache_tokens = 170, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 0 (`!`)
slot process_toke: id  0 | task 0 | n_decoded = 4, n_remaining = -1, next token:     0 '!'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 4
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 5, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 171, n_cache_tokens = 171, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 2650 (` How`)
slot process_toke: id  0 | task 0 | n_decoded = 5, n_remaining = -1, next token:  2650 ' How'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 5
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 6, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 172, n_cache_tokens = 172, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 649 (` can`)
slot process_toke: id  0 | task 0 | n_decoded = 6, n_remaining = -1, next token:   649 ' can'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 6
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 7, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 173, n_cache_tokens = 173, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 358 (` I`)
slot process_toke: id  0 | task 0 | n_decoded = 7, n_remaining = -1, next token:   358 ' I'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 7
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 8, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 174, n_cache_tokens = 174, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 7945 (` assist`)
slot process_toke: id  0 | task 0 | n_decoded = 8, n_remaining = -1, next token:  7945 ' assist'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 8
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 9, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 175, n_cache_tokens = 175, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 499 (` you`)
slot process_toke: id  0 | task 0 | n_decoded = 9, n_remaining = -1, next token:   499 ' you'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 9
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 10, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 176, n_cache_tokens = 176, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 3432 (` today`)
slot process_toke: id  0 | task 0 | n_decoded = 10, n_remaining = -1, next token:  3432 ' today'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 10
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 11, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 177, n_cache_tokens = 177, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 30 (`?`)
slot process_toke: id  0 | task 0 | n_decoded = 11, n_remaining = -1, next token:    30 '?'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 11
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 12, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 178, n_cache_tokens = 178, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
Grammar still awaiting trigger after token 128009 (`<|eot_id|>`)
slot process_toke: id  0 | task 0 | stopped by EOS
slot process_toke: id  0 | task 0 | n_decoded = 12, n_remaining = -1, next token: 128009 ''
slot      release: id  0 | task 0 | stop processing: n_past = 178, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     352.32 ms /   167 tokens (    2.11 ms per token,   474.00 tokens per second)
       eval time =     656.33 ms /    12 tokens (   54.69 ms per token,    18.28 tokens per second)
      total time =    1008.65 ms /   179 tokens
srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 12
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
srv  to_json_oaic: Parsing chat message: assistant
Hello! How can I assist you today?
Failed to parse functionary v3.2 input: Failed to parse json tool call arguments: assistant
Hello! How can I assist you today?
srv  remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove)
srv  log_server_r: request: POST /v1/chat/completions 172.17.0.1 200
srv  log_server_r: request:  {
"model": "gpt-3.5-turbo",
"tools": [
    {
    "type":"function",
    "function":{
        "name":"python",
        "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters":{
        "type":"object",
        "properties":{
            "code":{
            "type":"string",
            "description":"The code to run in the ipython interpreter."
            }
        },
        "required":["code"]
        }
    }
    }
],
"messages": [
    {
    "role": "user",
    "content": "Hi"
    }
]
}
srv  log_server_r: response: {"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"assistant\nHello! How can I assist you today?"}}],"created":1741214074,"model":"gpt-3.5-turbo","system_fingerprint":"b4783-a800ae46","object":"chat.completion","usage":{"completion_tokens":12,"prompt_tokens":167,"total_tokens":179},"id":"chatcmpl-mjWGBGon48tiZPymyQkXF1obPF9DKskh","__verbose":{"index":0,"content":"assistant\nHello! How can I assist you today?","tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":12,"tokens_evaluated":167,"generation_settings":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nfirst-tool-call ::= python-call\npython-args ::= \"{\" space python-args-code-kv \"}\" space\npython-args-code-kv ::= \"\\\"code\\\"\" space \":\" space string\npython-call ::= \"python\\n\" python-args\npython-call2 ::= \">>>python\\n\" python-args\nroot ::= first-tool-call space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n","grammar_trigger_words":[">>>python"],"grammar_trigger_tokens":[12958],"preserved_tokens":[12958],"chat_format":"Functionary v3.2","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou can call any of the following tools to satisfy the user's requests: [\n  {\n    \"type\": \"function\",\n    \"function\": {\n      \"name\": \"python\",\n      \"description\": \"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.\",\n      \"parameters\": {\n        \"type\": \"object\",\n        \"properties\": {\n          \"code\": {\n            \"type\": \"string\",\n            \"description\": \"The code to run in the ipython interpreter.\"\n          }\n        },\n        \"required\": [\n          \"code\"\n        ]\n      }\n    }\n  }\n]\n\nExample tool call syntax:\n\nassistant<|end_header_id|>\n\n>>>tool_name\n{\"arg1\": \"some_value\"}<|eot_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi<|eot_id|><|start_header_id|>{role}<|end_header_id|>\n\n","has_new_line":true,"truncated":false,"stop_type":"eos","stopping_word":"","tokens_cached":178,"timings":{"prompt_n":167,"prompt_ms":352.317,"prompt_per_token_ms":2.109682634730539,"prompt_per_second":474.00494441085726,"predicted_n":12,"predicted_ms":656.331,"predicted_per_token_ms":54.694250000000004,"predicted_per_second":18.283457584663836}},"timings":{"prompt_n":167,"prompt_ms":352.317,"prompt_per_token_ms":2.109682634730539,"prompt_per_second":474.00494441085726,"predicted_n":12,"predicted_ms":656.331,"predicted_per_token_ms":54.694250000000004,"predicted_per_second":18.283457584663836}}
@edmcman
Copy link
Author

edmcman commented Mar 5, 2025

@ochafik

@ochafik
Copy link
Collaborator

ochafik commented Mar 6, 2025

Hi @edmcman, thanks for reporting this!

So, the template stored in this GGUF model is bogus: it uses a {role} placeholder that has no meaning to jinja.

Compare this bit of the verbose log:

{% if add_generation_prompt %}{{ '<|start_header_id|>{role}<|end_header_id|>

' }}{% endif %}

with the end of the output of the official / original (maybe updated) template:

python scripts/get_chat_template.py meetkai/functionary-small-v3.2 tool_use 
...
{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n>>>' }}{% endif %}

You can see (if you don't filter your jq output) that the prompt ends with "<|start_header_id|>{role}<|end_header_id|>\n\n" which is just confuses the model. It adds an assistant\n prefix to try and set the record straight, which is... sweet I guess?

Try this instead:

llama-server --jinja -fa \
  -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M \
  --chat-template-file models/templates/meetkai-functionary-medium-v3.2.jinja

(most of the functionary v3.2 tests in test_tool_call.py do use a chat template override but somehow I forgot the override in the docs, I'll fix this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants