Skip to content

Conversation

@SamuelOliveirads
Copy link

Hey @F1LM1, it's been my turn to be busy lately, so I haven't made much progress. However, this gave me time to rethink the future of this PR, and I decided to clean up the code for a proper review. Here are my thoughts:

  • The core idea of the original PR was the MTP implementation. Although performance isn't currently better than having it disabled, it is functional, which I see as a solid contribution to the project.
  • This PR has been open for almost four months, and the next set of improvements could take just as long to finish.

I took the latest improvements we made and tested each one with small prompts. This new PR is the result. With my configuration, I'm getting about 84% of the tokens per second compared to running without MTP. I also reviewed the entire project to remove unused code and references, and finally added a command-line argument so everyone can easily try it with or without MTP.

My suggestion is as follows:

  1. You review this cleanup, and if it looks good, merge it into the original PR branch.
  2. Mark the original PR as ready for review. If necessary, I can tag some of the frequent reviewers to take a look.
  3. I will be available to fix any issues found in the original PR and focus on getting it merged. In parallel, I will keep working on the server loop optimization and multi-token MTP in PR Glm4 mtp optimizations #4.

F1LM1 and others added 30 commits August 10, 2025 23:52
@F1LM1
Copy link
Owner

F1LM1 commented Dec 7, 2025

Sounds good to me, I'll look over it more formally over the next 2-3 days.

@wishstudio
Copy link

I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808

It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work.

By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious.

@SamuelOliveirads SamuelOliveirads changed the title Fix/improve mt performance Fix/improve mtp performance Dec 7, 2025
@SamuelOliveirads SamuelOliveirads deleted the fix/improve-mt-performance branch December 7, 2025 22:06
@SamuelOliveirads SamuelOliveirads restored the fix/improve-mt-performance branch December 7, 2025 22:07
@SamuelOliveirads
Copy link
Author

SamuelOliveirads commented Dec 7, 2025

(The name of the branch bothered me, but unfortunately the PR closes if I try to rename it... well, it's going to be MT for now)

By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious.

Sure. I usually don't test under best-case scenarios. For development, I use Windows, which incurs a slight performance drop. Right now, my specs are a Threadripper 5965WX and two RTX 3090s. I tried months ago with a Ryzen 7 5700X and noticed the same pattern. I was expecting a offload system to no be great, but it still bothers me that performance without MTP is higher. Here are the commands:

cmake -B build -DCMAKE_BUILD_TYPE=Debug -DLLAMA_CURL=OFF -DGGML_CUDA=ON

cmake --build build --config Release -j

.\build\bin\release\llama-server.exe ^
    --model "F:\llm_models\glm-4.5-air_Q4_general\GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf" ^
    --alias GLM-4.5-Air ^
    --ctx-size 36864 ^
    -ctk q8_0 -ctv q8_0 ^
    -fa --verbose ^
   --n-gpu-layers 99 ^
    -b 2048 -ub 1500 ^
   -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18)\.ffn_.*=CUDA0" ^
    -ot "blk\.(20|21|22|23|24|25|26|27|28|29|30|31|32|33|34)\.ffn_.*=CUDA1" ^
   --override-tensor exps=CPU ^
    -mtp ^
    --threads 24 --threads-batch 36 ^
    --host 127.0.0.1 ^
    --port 8080

I haven't tried using Linux for this branch or with GLM 4.5/4.6 yet, and I also didn't fully optimize for NUMA. That's why I'm eager to see how it performs for others.
llama-bench is not the best tool for testing, at least for me, so I typically try with small prompts and sometimes with code generation or creative writing. On average, small prompts gave 12.19 T/s. Using this branch originally gave me 10.52 T/s, and now it reaches up to 11.05 T/s, which is about 90% of the original performance.

I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808

It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work.

That's great to hear! I made negligible improvements to the server loop—nothing that yielded significant gains. I will update the branch, take a look at the mentioned PR, and then create a new one that will allow more than one draft per loop.

@wishstudio
Copy link

wishstudio commented Dec 7, 2025

Thank you for the information! I will take a look. Here are my first intuitions:

In a --cpu-moe setup typically the CPU part is memory bound. IIRC the IQ quants are computational heavier than regular Qx_0 or Qx_K quants. So using that may cause it to shift to computational bound which incur more performance penalties.

To validate this, you can either:

  • Try a computational lighter quant, like Q4_K, or better Q8_0.
  • Log the time spent in MoE FFNs and check if it is memory bound. You can easily measure by calculating the time spent in the loop for (int cur_a = 0; cur_a < n_as; ++cur_a) in ggml_compute_forward_mul_mat_id function in ggml-cpu.c. You can then compare of the time with the theoretical time of a FFN block of your used quant (calculate the size in bytes of the corresponding weight matrix, then divide it by your actual memory bandwidth, which can be measured by intel MLC tool). You may need to also log the dimensions of ids tensors to do it correctly since there will be some shared experts in consecutive tokens. In my experiments for my gaming rig, the actual time spent in FFN blocks are quite close to theoretical time.

You mentioned NUMA which I think is not the major problem here as although generally speaking llama.cpp is not NUMA aware, the inter CCD bandwidth in a single CPU is still much higher than typical inter socket bandwidth.

The major additional bottleneck with Windows to my knowledge is much higher kernel launch overhead. CUDA graphs can be used to eliminate this but it is currently disabled in split graphs. Anyway it's irrelevant here because you basically have the same overhead with or without MTP.

Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in --cpu-moe but not tg. But I guess it's also irrelevant to MTP performance comparison. You can try if adding --no-mmap makes any difference, without the option it will not use pinned memory for model weights.

@F1LM1
Copy link
Owner

F1LM1 commented Dec 8, 2025

I will need to test more carefully but I'm also using a cpu MoE setup and my tg/s goes from ~10.2 off to ~11.5 on. Haven't yet tried to optimize settings for the on case.

One very loose thought is

ctx-size 36864

This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something?

@SamuelOliveirads
Copy link
Author

  • Try a computational lighter quant, like Q4_K, or better Q8_0.
  • Log the time spent in MoE FFNs and check if it is memory bound. You can easily measure by calculating the time spent in the loop for (int cur_a = 0; cur_a < n_as; ++cur_a) in ggml_compute_forward_mul_mat_id function in ggml-cpu.c. You can then compare of the time with the theoretical time of a FFN block of your used quant (calculate the size in bytes of the corresponding weight matrix, then divide it by your actual memory bandwidth, which can be measured by intel MLC tool). You may need to also log the dimensions of ids tensors to do it correctly since there will be some shared experts in consecutive tokens. In my experiments for my gaming rig, the actual time spent in FFN blocks are quite close to theoretical time.

That's good to know, I will measure it after I finish the next PR. At least right now for me the performance using the Air variant doesn't bother me. My go-to is still the GLM 4.6 in Linux, I just prefer to code using Windows. Still, it's good to see if there is something hurting my performance that doesn't allow MTP to increase higher than without. Also I'm actually curious to try a Q1 just to fit fully in GPU and see how it works.

Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in --cpu-moe but not tg. But I guess it's also irrelevant to MTP performance comparison. You can try if adding --no-mmap makes any difference, without the option it will not use pinned memory for model weights.

True, and I saw that when I used GLM 4.5 for the first week; it was around 3 tokens that jumped to 5 by disabling mmap.

This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something?

Yes, when the model was released I used it quite a lot and that context filled the entire VRAM. Today I don't need it anymore, but because I borrowed the old args I didn't change them, and as far as I remember I removed some layers from both GPUs to free memory. But I need to double check if it is offloading to memory again and hurting the performance.

SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 10, 2025
…gml-org#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING
@SamuelOliveirads
Copy link
Author

@F1LM1, how is the review going? I hope everything is alright. I’ve submitted #5 and #6 , which should hopefully finalize the MTP implementation—or at least provide a solid foundation for future contributions. If you need any clarification or further changes, please let me know and I will gladly help!

@F1LM1
Copy link
Owner

F1LM1 commented Dec 19, 2025

@F1LM1, how is the review going? I hope everything is alright. I’ve submitted #5 and #6 , which should hopefully finalize the MTP implementation—or at least provide a solid foundation for future contributions. If you need any clarification or further changes, please let me know and I will gladly help!

Looks really good, super polished compared to what we had before, and I'm also getting reproducible uplift when drafting N = 1 tokens, in the range of 6-10% over baseline (and notably the baseline is better than what it was on the old base). For whatever reason #5 was a bit buggy for me but #6 is not. In most of my test prompts I see the 1st, 2nd, and 3rd speculative token with around 75%, 50%, and 25% acceptance, respectively. I'm not sure if it has to do with overhead in CPU-only or hybrid inference modes but I'm a bit surprised that N > 1 doesn't perform better than it does (on my machine) especially given your overhead optimizations. But maybe it's a sign that things can still improve from here.

Anyway, I have two hopefully very easy suggestions. The first is more important:

  • When MTP is off, we still load the MTP layer, which is of course just a waste of memory for anyone who doesn't want to use MTP. And if someone is running their setup "on the edge" in terms of RAM/VRAM this might just tip them over the edge which is a bit annoying; best that we don't accidentally cause any regressions.
  • At least on my machine and test prompts, the added overhead of heavy sampling (when drafting) takes this feature from a net positive to a negative again, while in most cases "always greedy sampling" doesn't seem to drop draft acceptance rates by too much. Maybe let's add an override option to allow using the greedy sampler even when reptition/presence/etc penalties are turned on. Honestly maybe even default to using the greedy sampler since in the vast majority of my test cases, either the greedy sampler is better, or (with extreme sampler settings) draft acceptance rates are so bad that there's no point in using MTP to begin with.

@SamuelOliveirads
Copy link
Author

Glad to hear that it's working well for your tests!

  • When MTP is off, we still load the MTP layer, which is of course just a waste of memory for anyone who doesn't want to use MTP. And if someone is running their setup "on the edge" in terms of RAM/VRAM this might just tip them over the edge which is a bit annoying; best that we don't accidentally cause any regressions.
  • At least on my machine and test prompts, the added overhead of heavy sampling (when drafting) takes this feature from a net positive to a negative again, while in most cases "always greedy sampling" doesn't seem to drop draft acceptance rates by too much. Maybe let's add an override option to allow using the greedy sampler even when reptition/presence/etc penalties are turned on. Honestly maybe even default to using the greedy sampler since in the vast majority of my test cases, either the greedy sampler is better, or (with extreme sampler settings) draft acceptance rates are so bad that there's no point in using MTP to begin with.

I always had the feeling that sampling might not give enough benefits for us, but at least in my testing using GLM 4.5 Air and 4.6, it didn't give me detrimental performance. For Air it's inconclusive (basically the same), but using version 4.6 it's an improvement of 18% in draft rate and almost 6% better t/s. Still, because not everyone can get good performance, I preferred to remove it.

To make things easier, I applied both changes in the last PR.

Looks really good, super polished compared to what we had before, and I'm also getting reproducible uplift when drafting N = 1 tokens, in the range of 6-10% over baseline (and notably the baseline is better than what it was on the old base). For whatever reason #5 was a bit buggy for me but #6 is not. In most of my test prompts I see the 1st, 2nd, and 3rd speculative token with around 75%, 50%, and 25% acceptance, respectively. I'm not sure if it has to do with overhead in CPU-only or hybrid inference modes but I'm a bit surprised that N > 1 doesn't perform better than it does (on my machine) especially given your overhead optimizations. But maybe it's a sign that things can still improve from here.

Many things can impact performance. For me, 2 tokens go around a 66% rate, which gives me better performance than 1. Still, MTP is a new architecture in llama.cpp, so it's expected to lack minor but important improvements. Think about the differences between the old and newer baselines; after the GLM 4.5 implementation, many folks made a lot of small improvements that helped us get better performance in #6. This will definitely replicate in MTP when ggml-org#15225 becomes merged.

I finished my to-do list for the MTP implementation and I hope that this will be enough to make the original PR merged.

SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
ggml-org#958)

* port upstream ggml-org#16932

* Add fixed chat templates.

* fix grammar when tool have no argument

* Insert additional stops for Kimi-K2

* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6

* update chat.cpp

* fix grammar for GLM 4.5/4.6

* chat: Fix streaming parser for granite models (ggml-org#15682)

* fix(chat): fix streaming parser for granite models

* tests: add test cases for Granite models chat parser

* common : Fix corrupted memory error on json grammar initialization (ggml-org#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING

* common : fix reasoning before forced tool call via tool_choice = required (ggml-org#16264)

* common : fix reasoning before forced tool call via tool_choice = required

* common : improve reasoning and commentary handling when tool_choice is required

(cherry picked from commit c746984)

---------

Co-authored-by: Alde Rojas <hello@alde.dev>

* Try fix Jinja template for GLM

* Improve Kimi-K2 chat template

* Fix "Invalid tool call arguments passed" in a rare case.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

---------

Co-authored-by: shun095 <8069181+shun095@users.noreply.github.com>
Co-authored-by: David Ribeiro Alves <davidralves@gmail.com>
Co-authored-by: crat0z <11581854+crat0z@users.noreply.github.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants