Fix/improve mtp performance #5

SamuelOliveirads · 2025-12-06T18:21:15Z

Hey @F1LM1, it's been my turn to be busy lately, so I haven't made much progress. However, this gave me time to rethink the future of this PR, and I decided to clean up the code for a proper review. Here are my thoughts:

The core idea of the original PR was the MTP implementation. Although performance isn't currently better than having it disabled, it is functional, which I see as a solid contribution to the project.
This PR has been open for almost four months, and the next set of improvements could take just as long to finish.

I took the latest improvements we made and tested each one with small prompts. This new PR is the result. With my configuration, I'm getting about 84% of the tokens per second compared to running without MTP. I also reviewed the entire project to remove unused code and references, and finally added a command-line argument so everyone can easily try it with or without MTP.

My suggestion is as follows:

You review this cleanup, and if it looks good, merge it into the original PR branch.
Mark the original PR as ready for review. If necessary, I can tag some of the frequent reviewers to take a look.
I will be available to fix any issues found in the original PR and focus on getting it merged. In parallel, I will keep working on the server loop optimization and multi-token MTP in PR Glm4 mtp optimizations #4.

…lative.cpp

…nt is unreasonable

…nd n_ubatch

feat: implemented sampling for MTP

…orruption

…tion enum

…helper methods

… debug logs

…llama.cpp into glm4-mtp-graph-cache

F1LM1 · 2025-12-07T08:26:35Z

Sounds good to me, I'll look over it more formally over the next 2-3 days.

wishstudio · 2025-12-07T09:45:13Z

I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808

It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work.

By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious.

SamuelOliveirads · 2025-12-07T22:28:55Z

(The name of the branch bothered me, but unfortunately the PR closes if I try to rename it... well, it's going to be MT for now)

By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious.

Sure. I usually don't test under best-case scenarios. For development, I use Windows, which incurs a slight performance drop. Right now, my specs are a Threadripper 5965WX and two RTX 3090s. I tried months ago with a Ryzen 7 5700X and noticed the same pattern. I was expecting a offload system to no be great, but it still bothers me that performance without MTP is higher. Here are the commands:

cmake -B build -DCMAKE_BUILD_TYPE=Debug -DLLAMA_CURL=OFF -DGGML_CUDA=ON

cmake --build build --config Release -j

.\build\bin\release\llama-server.exe ^
    --model "F:\llm_models\glm-4.5-air_Q4_general\GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf" ^
    --alias GLM-4.5-Air ^
    --ctx-size 36864 ^
    -ctk q8_0 -ctv q8_0 ^
    -fa --verbose ^
   --n-gpu-layers 99 ^
    -b 2048 -ub 1500 ^
   -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18)\.ffn_.*=CUDA0" ^
    -ot "blk\.(20|21|22|23|24|25|26|27|28|29|30|31|32|33|34)\.ffn_.*=CUDA1" ^
   --override-tensor exps=CPU ^
    -mtp ^
    --threads 24 --threads-batch 36 ^
    --host 127.0.0.1 ^
    --port 8080

I haven't tried using Linux for this branch or with GLM 4.5/4.6 yet, and I also didn't fully optimize for NUMA. That's why I'm eager to see how it performs for others.
llama-bench is not the best tool for testing, at least for me, so I typically try with small prompts and sometimes with code generation or creative writing. On average, small prompts gave 12.19 T/s. Using this branch originally gave me 10.52 T/s, and now it reaches up to 11.05 T/s, which is about 90% of the original performance.

I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808

It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work.

That's great to hear! I made negligible improvements to the server loop—nothing that yielded significant gains. I will update the branch, take a look at the mentioned PR, and then create a new one that will allow more than one draft per loop.

wishstudio · 2025-12-07T23:49:57Z

Thank you for the information! I will take a look. Here are my first intuitions:

In a --cpu-moe setup typically the CPU part is memory bound. IIRC the IQ quants are computational heavier than regular Qx_0 or Qx_K quants. So using that may cause it to shift to computational bound which incur more performance penalties.

To validate this, you can either:

Try a computational lighter quant, like Q4_K, or better Q8_0.
Log the time spent in MoE FFNs and check if it is memory bound. You can easily measure by calculating the time spent in the loop for (int cur_a = 0; cur_a < n_as; ++cur_a) in ggml_compute_forward_mul_mat_id function in ggml-cpu.c. You can then compare of the time with the theoretical time of a FFN block of your used quant (calculate the size in bytes of the corresponding weight matrix, then divide it by your actual memory bandwidth, which can be measured by intel MLC tool). You may need to also log the dimensions of ids tensors to do it correctly since there will be some shared experts in consecutive tokens. In my experiments for my gaming rig, the actual time spent in FFN blocks are quite close to theoretical time.

You mentioned NUMA which I think is not the major problem here as although generally speaking llama.cpp is not NUMA aware, the inter CCD bandwidth in a single CPU is still much higher than typical inter socket bandwidth.

The major additional bottleneck with Windows to my knowledge is much higher kernel launch overhead. CUDA graphs can be used to eliminate this but it is currently disabled in split graphs. Anyway it's irrelevant here because you basically have the same overhead with or without MTP.

Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in --cpu-moe but not tg. But I guess it's also irrelevant to MTP performance comparison. You can try if adding --no-mmap makes any difference, without the option it will not use pinned memory for model weights.

F1LM1 · 2025-12-08T03:50:44Z

I will need to test more carefully but I'm also using a cpu MoE setup and my tg/s goes from ~10.2 off to ~11.5 on. Haven't yet tried to optimize settings for the on case.

One very loose thought is

ctx-size 36864

This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something?

SamuelOliveirads · 2025-12-10T00:32:24Z

Try a computational lighter quant, like Q4_K, or better Q8_0.

Log the time spent in MoE FFNs and check if it is memory bound. You can easily measure by calculating the time spent in the loop for (int cur_a = 0; cur_a < n_as; ++cur_a) in ggml_compute_forward_mul_mat_id function in ggml-cpu.c. You can then compare of the time with the theoretical time of a FFN block of your used quant (calculate the size in bytes of the corresponding weight matrix, then divide it by your actual memory bandwidth, which can be measured by intel MLC tool). You may need to also log the dimensions of ids tensors to do it correctly since there will be some shared experts in consecutive tokens. In my experiments for my gaming rig, the actual time spent in FFN blocks are quite close to theoretical time.

That's good to know, I will measure it after I finish the next PR. At least right now for me the performance using the Air variant doesn't bother me. My go-to is still the GLM 4.6 in Linux, I just prefer to code using Windows. Still, it's good to see if there is something hurting my performance that doesn't allow MTP to increase higher than without. Also I'm actually curious to try a Q1 just to fit fully in GPU and see how it works.

Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in --cpu-moe but not tg. But I guess it's also irrelevant to MTP performance comparison. You can try if adding --no-mmap makes any difference, without the option it will not use pinned memory for model weights.

True, and I saw that when I used GLM 4.5 for the first week; it was around 3 tokens that jumped to 5 by disabling mmap.

This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something?

Yes, when the model was released I used it quite a lot and that context filled the entire VRAM. Today I don't need it anymore, but because I borrowed the old args I didn't change them, and as far as I remember I removed some layers from both GPUs to free memory. But I need to double check if it is offloading to memory again and hurting the performance.

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

SamuelOliveirads · 2025-12-18T20:56:10Z

@F1LM1, how is the review going? I hope everything is alright. I’ve submitted #5 and #6 , which should hopefully finalize the MTP implementation—or at least provide a solid foundation for future contributions. If you need any clarification or further changes, please let me know and I will gladly help!

F1LM1 · 2025-12-19T08:10:08Z

@F1LM1, how is the review going? I hope everything is alright. I’ve submitted #5 and #6 , which should hopefully finalize the MTP implementation—or at least provide a solid foundation for future contributions. If you need any clarification or further changes, please let me know and I will gladly help!

Looks really good, super polished compared to what we had before, and I'm also getting reproducible uplift when drafting N = 1 tokens, in the range of 6-10% over baseline (and notably the baseline is better than what it was on the old base). For whatever reason #5 was a bit buggy for me but #6 is not. In most of my test prompts I see the 1st, 2nd, and 3rd speculative token with around 75%, 50%, and 25% acceptance, respectively. I'm not sure if it has to do with overhead in CPU-only or hybrid inference modes but I'm a bit surprised that N > 1 doesn't perform better than it does (on my machine) especially given your overhead optimizations. But maybe it's a sign that things can still improve from here.

Anyway, I have two hopefully very easy suggestions. The first is more important:

When MTP is off, we still load the MTP layer, which is of course just a waste of memory for anyone who doesn't want to use MTP. And if someone is running their setup "on the edge" in terms of RAM/VRAM this might just tip them over the edge which is a bit annoying; best that we don't accidentally cause any regressions.
At least on my machine and test prompts, the added overhead of heavy sampling (when drafting) takes this feature from a net positive to a negative again, while in most cases "always greedy sampling" doesn't seem to drop draft acceptance rates by too much. Maybe let's add an override option to allow using the greedy sampler even when reptition/presence/etc penalties are turned on. Honestly maybe even default to using the greedy sampler since in the vast majority of my test cases, either the greedy sampler is better, or (with extreme sampler settings) draft acceptance rates are so bad that there's no point in using MTP to begin with.

SamuelOliveirads · 2025-12-20T01:49:41Z

Glad to hear that it's working well for your tests!

When MTP is off, we still load the MTP layer, which is of course just a waste of memory for anyone who doesn't want to use MTP. And if someone is running their setup "on the edge" in terms of RAM/VRAM this might just tip them over the edge which is a bit annoying; best that we don't accidentally cause any regressions.

At least on my machine and test prompts, the added overhead of heavy sampling (when drafting) takes this feature from a net positive to a negative again, while in most cases "always greedy sampling" doesn't seem to drop draft acceptance rates by too much. Maybe let's add an override option to allow using the greedy sampler even when reptition/presence/etc penalties are turned on. Honestly maybe even default to using the greedy sampler since in the vast majority of my test cases, either the greedy sampler is better, or (with extreme sampler settings) draft acceptance rates are so bad that there's no point in using MTP to begin with.

I always had the feeling that sampling might not give enough benefits for us, but at least in my testing using GLM 4.5 Air and 4.6, it didn't give me detrimental performance. For Air it's inconclusive (basically the same), but using version 4.6 it's an improvement of 18% in draft rate and almost 6% better t/s. Still, because not everyone can get good performance, I preferred to remove it.

To make things easier, I applied both changes in the last PR.

Looks really good, super polished compared to what we had before, and I'm also getting reproducible uplift when drafting N = 1 tokens, in the range of 6-10% over baseline (and notably the baseline is better than what it was on the old base). For whatever reason #5 was a bit buggy for me but #6 is not. In most of my test prompts I see the 1st, 2nd, and 3rd speculative token with around 75%, 50%, and 25% acceptance, respectively. I'm not sure if it has to do with overhead in CPU-only or hybrid inference modes but I'm a bit surprised that N > 1 doesn't perform better than it does (on my machine) especially given your overhead optimizations. But maybe it's a sign that things can still improve from here.

Many things can impact performance. For me, 2 tokens go around a 66% rate, which gives me better performance than 1. Still, MTP is a new architecture in llama.cpp, so it's expected to lack minor but important improvements. Think about the differences between the old and newer baselines; after the GLM 4.5 implementation, many folks made a lot of small improvements that helped us get better performance in #6. This will definitely replicate in MTP when ggml-org#15225 becomes merged.

I finished my to-do list for the MTP implementation and I hope that this will be enough to make the original PR merged.

ggml-org#958) * port upstream ggml-org#16932 * Add fixed chat templates. * fix grammar when tool have no argument * Insert additional stops for Kimi-K2 * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6 * update chat.cpp * fix grammar for GLM 4.5/4.6 * chat: Fix streaming parser for granite models (ggml-org#15682) * fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser * common : Fix corrupted memory error on json grammar initialization (ggml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING * common : fix reasoning before forced tool call via tool_choice = required (ggml-org#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984) --------- Co-authored-by: Alde Rojas <hello@alde.dev> * Try fix Jinja template for GLM * Improve Kimi-K2 chat template * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: shun095 <8069181+shun095@users.noreply.github.com> Co-authored-by: David Ribeiro Alves <davidralves@gmail.com> Co-authored-by: crat0z <11581854+crat0z@users.noreply.github.com> Co-authored-by: Alde Rojas <hello@alde.dev>

F1LM1 and others added 30 commits August 10, 2025 23:52

added getter for nextn layer count and server slot has_mtp property

db60623

some work towards building mtp layer graph

e434f87

make nextn weights loadable without a crash

1f477b3

add model member function to build mtp graph, to be called from specu…

03231da

…lative.cpp

broad thrust of the mtp implementation

cf0f7c0

failed attempt to implement MTP; outputs tokens but KV cache manageme…

6e9bafc

…nt is unreasonable

added proper KV cache management for MTP layers and slightly refactored

6870f97

fixed mtp kv cache update sequencing after prompt processing

382135a

kludge-y kv cache management of mtp layer

d72f9d5

fixed vram leak

471e026

replace standard sampler with greedy sampler for mtp draft

98bc0c6

fixed mtp kv cache update step in cases where prompt size > n_batch a…

9fab53e

…nd n_ubatch

feat: implemented sampling for MTP

07670a2

fix: add sample acceptance

5a5bce8

feat: apply logits + greedy sampler

8742ce0

Merge pull request F1LM1#1 from SamuelOliveirads/glm4-moe-mtp

c6237c7

feat: implemented sampling for MTP

mtp-batch (wip): move mtp execution to batch format

1318b2d

mtp-batch (wip): merge mtp and model graph

042eb8a

mtp-batch (wip): merge glm graphs

df64508

mtp-batch (fix): warm mtp cache for small batch size

3da7e7f

mtp-batch (wip): organize batch for mtp cache

75dc25e

mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer c…

67c6c06

…orruption

mtp-batch (wip): fix how to warmup kv cache for MTP

febd823

mtp-batch (feat): Create and manage sinfo for MTP

5e1d719

mtp-batch (fix): prevent mtp draft from polluting the cache

6f74ba3

mtp-batch(refactor): Replace MTP boolean flags with an explicit opera…

913af8f

…tion enum

mtp-batch(refactor): Extract decode context and MTP input logic into …

a99709d

…helper methods

mtp-batch(chore): Fix logit flags for speculative sampling and remove…

b4cbe03

… debug logs

mtp-batch(fix): Correctly advance cache head and add MTP documentation

4bcc9e2

mtp-batch(chore): Remove final MTP debug logs and dead code

0127c6b

SamuelOliveirads added 7 commits October 12, 2025 16:33

mtp-graph(feat): Reactivate graph reuse only for main model path

171346c

mtp-batch(fix): avoid logits for mtp kv cache operations

cae85fe

Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/…

15dff20

…llama.cpp into glm4-mtp-graph-cache

mtp-graph (fix): move llama_get_logits_ith outside the loop

ea77394

mtp (feat): add mtp arg

6de0ecf

mtp (chore): clean old code

a91980a

sampling (feat): optimize speculative drafting with fast-path selection

bdf72d9

SamuelOliveirads changed the title ~~Fix/improve mt performance~~ Fix/improve mtp performance Dec 7, 2025

SamuelOliveirads closed this Dec 7, 2025

SamuelOliveirads deleted the fix/improve-mt-performance branch December 7, 2025 22:06

SamuelOliveirads restored the fix/improve-mt-performance branch December 7, 2025 22:07

SamuelOliveirads reopened this Dec 7, 2025

SamuelOliveirads mentioned this pull request Dec 10, 2025

Feat/glm4 mtp recursive #6

Merged

F1LM1 force-pushed the glm4-moe-mtp branch from 79b0fea to d10a5a4 Compare December 21, 2025 22:59

SamuelOliveirads closed this Dec 24, 2025

RatStar811 approved these changes Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/improve mtp performance #5

Fix/improve mtp performance #5

Uh oh!

SamuelOliveirads commented Dec 6, 2025

Uh oh!

F1LM1 commented Dec 7, 2025

Uh oh!

wishstudio commented Dec 7, 2025

Uh oh!

SamuelOliveirads commented Dec 7, 2025 •

edited

Loading

Uh oh!

wishstudio commented Dec 7, 2025 •

edited

Loading

Uh oh!

F1LM1 commented Dec 8, 2025

Uh oh!

SamuelOliveirads commented Dec 10, 2025

Uh oh!

SamuelOliveirads commented Dec 18, 2025

Uh oh!

F1LM1 commented Dec 19, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix/improve mtp performance #5

Fix/improve mtp performance #5

Uh oh!

Conversation

SamuelOliveirads commented Dec 6, 2025

Uh oh!

F1LM1 commented Dec 7, 2025

Uh oh!

wishstudio commented Dec 7, 2025

Uh oh!

SamuelOliveirads commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wishstudio commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

F1LM1 commented Dec 8, 2025

Uh oh!

SamuelOliveirads commented Dec 10, 2025

Uh oh!

SamuelOliveirads commented Dec 18, 2025

Uh oh!

F1LM1 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SamuelOliveirads commented Dec 7, 2025 •

edited

Loading

wishstudio commented Dec 7, 2025 •

edited

Loading

F1LM1 commented Dec 19, 2025 •

edited

Loading