Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

glide-the · 2025-04-22T03:57:54Z

Fix: Resolved "Cannot find tokenizer merges in model file" Issue

This PR addresses the tokenizer merge issue (cannot find tokenizer merges in model file) when loading certain models, especially those converted from HuggingFace. The solution is based on insights from the following discussions and PRs:

Verification Steps

1. Build

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_C_COMPILER=gcc-13 \
  -DCMAKE_CXX_COMPILER=g++-13 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

2. Convert HF Weights

python convert_hf_to_gguf.py THUDM/glm-4-9b \
  --outfile glm-4-9b.gguf \
  --outtype q8_0

3. Run Inference

./llama-cli -m /mnt/ceph/develop/jiawei/model_checkpoint/glm-4-9b.gguf -ngl 200000 -p "你好啊"

Known Issue

Refer to: #7441

In llama.cpp, special tokens (e.g., eos_token_id) are currently mapped one-to-one (token → ID). However, in actual transformer models, these tokens might correspond to multiple tokens or require multi-token representations.

This mismatch can cause issues where the model doesn't terminate generation correctly.

The exact handling logic and call chain for special_token in llama.cpp remains unclear and might require further investigation.

You can see a temporary workaround here. #9606

convert_hf_to_gguf.py

glide-the · 2025-04-23T03:51:39Z

The code now supports GLM variant models, including LLaMA-style and GPT-2-style vocabularies.

Tested inference compatibility for the following base models:

Current Status

Inference works with the above base models.
However, there are known issues with stop token handling, as discussed in [llama.cpp issue #9606](#9606):

llama.cpp uses <|endoftext|> and <|im_end|> as stop tokens across all modes,
but ideally, generation should stop on either of them.
Therefore, it’s not as simple as redefining eos_token_id or eot_token_id.

so llama.cpp update keep track of all EOG tokens in the vocab #9609 , use in stop be that encountered.

Function Call Token Support

Special token handling for function call-style generation will be submitted in a separate PR.

…and GPT-2 style tokenizers.

johnpyp · 2025-04-23T05:55:11Z

Function Call Token Support
Special token handling for function call-style generation will be submitted in a separate PR.

Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?

glide-the · 2025-04-23T06:32:30Z

Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?

Function Call Compatibility for GLM

The main modification I made to support function call capabilities in GLM is ensuring that the special token <|observation|> correctly triggers its intended behavior. In GLM models, this token acts as a special stop word.

Example usage pattern:

<|user|>
text<|assistant|>test<|observation|>

In llama.cpp, there is no built-in mapping for <|observation|>, so I added it to the EOG (End of Generation) tokens list.

Loading Chain

Once the model is loaded, the following call chain is involved:

main()
→ llama_model_load_from_file()
→ llama_model_load()
→ llama_model_loader()

Sampling Chain

During sampling, the function llama_sampler_chain_apply is responsible for executing the model inference chain (llama_sampler_chain, smpl->ctx).
At the end of each sampling step, the logic checks for eos, eot, or eog tokens and triggers the corresponding stop behavior.

Refer to the implementation in main.cpp:

if (need_insert_eot && format_chat) {
    llama_token eot = llama_vocab_eot(vocab);
    embd_inp.push_back(eot == LLAMA_TOKEN_NULL ? llama_vocab_eos(vocab) : eot);
    need_insert_eot = false;
}

if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]\n");
    break;
}

Regarding Function Call Token Implementation

To support tokens like <|observation|> for function call behavior, it may be sufficient to simply include it in the EOG detection logic.
I haven’t yet fully validated this approach, but it appears promising.

for (const auto & t : token_to_id) {
    // find EOT token: "<|eot_id|>", "<|im_end|>", "<end_of_turn>", etc.
    if (special_eot_id == LLAMA_TOKEN_NULL) {
        if (false
                || t.first == "<|eot_id|>"
                || t.first == "<|im_end|>"
                || t.first == "<|end|>"
                || t.first == "<end_of_turn>"
                || t.first == "<|endoftext|>"
                || t.first == "<EOT>"
                || t.first == "_<EOT>"
                || t.first == "<｜end▁of▁sentence｜>" // DeepSeek
           ) {
            special_eot_id = t.second;
            if ((id_to_token[t.second].attr & LLAMA_TOKEN_ATTR_CONTROL) == 0) {
                LLAMA_LOG_WARN("%s: control-looking token: %6d '%s' was not control-type; this is probably a bug in the model. its type will be overridden\n",
                        __func__, t.second, t.first.c_str());
                id_to_token[t.second].attr = LLAMA_TOKEN_ATTR_CONTROL;
            }
        }
    }
}

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file

df4580e

github-actions bot added the python python script changes label Apr 22, 2025

ngxson reviewed Apr 22, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson reviewed Apr 22, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson reviewed Apr 22, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

glide-the added 2 commits April 22, 2025 17:37

update tokenizer_model

4ce6630

update tokenizer_model name

164e34e

This comment was marked as outdated.

Sign in to view

Supports compatibility with GLM variant models, including both LLaMA …

1606e81

…and GPT-2 style tokenizers.

glide-the force-pushed the fix_vocab_glm9b branch from aa501d9 to 1606e81 Compare April 23, 2025 03:54

Supports compatibility with GLM variant models, including both LLaMA …

7b42c07

…and GPT-2 style tokenizers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

glide-the commented Apr 22, 2025 •

edited

Loading

This comment was marked as outdated.

glide-the commented Apr 23, 2025 •

edited

Loading

johnpyp commented Apr 23, 2025

glide-the commented Apr 23, 2025

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

Are you sure you want to change the base?

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

Conversation

glide-the commented Apr 22, 2025 • edited Loading

Fix: Resolved "Cannot find tokenizer merges in model file" Issue

Verification Steps

Known Issue

This comment was marked as outdated.

glide-the commented Apr 23, 2025 • edited Loading

Current Status

Function Call Token Support

johnpyp commented Apr 23, 2025

glide-the commented Apr 23, 2025

Function Call Compatibility for GLM

Loading Chain

Sampling Chain

Regarding Function Call Token Implementation

glide-the commented Apr 22, 2025 •

edited

Loading

glide-the commented Apr 23, 2025 •

edited

Loading