Skip to content

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

glide-the
Copy link

@glide-the glide-the commented Apr 22, 2025

Fix: Resolved "Cannot find tokenizer merges in model file" Issue

This PR addresses the tokenizer merge issue (cannot find tokenizer merges in model file) when loading certain models, especially those converted from HuggingFace. The solution is based on insights from the following discussions and PRs:

Verification Steps

1. Build

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_C_COMPILER=gcc-13 \
  -DCMAKE_CXX_COMPILER=g++-13 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

2. Convert HF Weights

python convert_hf_to_gguf.py THUDM/glm-4-9b \
  --outfile glm-4-9b.gguf \
  --outtype q8_0

3. Run Inference

./llama-cli -m /mnt/ceph/develop/jiawei/model_checkpoint/glm-4-9b.gguf -ngl 200000 -p "你好啊"

Known Issue

Refer to: #7441

In llama.cpp, special tokens (e.g., eos_token_id) are currently mapped one-to-one (token → ID). However, in actual transformer models, these tokens might correspond to multiple tokens or require multi-token representations.

This mismatch can cause issues where the model doesn't terminate generation correctly.

The exact handling logic and call chain for special_token in llama.cpp remains unclear and might require further investigation.

You can see a temporary workaround here. #9606

@github-actions github-actions bot added the python python script changes label Apr 22, 2025
@glide-the

This comment was marked as outdated.

@glide-the
Copy link
Author

glide-the commented Apr 23, 2025

The code now supports GLM variant models, including LLaMA-style and GPT-2-style vocabularies.

Tested inference compatibility for the following base models:

Current Status

Inference works with the above base models.
However, there are known issues with stop token handling, as discussed in [llama.cpp issue #9606](#9606):

llama.cpp uses <|endoftext|> and <|im_end|> as stop tokens across all modes,
but ideally, generation should stop on either of them.
Therefore, it’s not as simple as redefining eos_token_id or eot_token_id.

so llama.cpp update keep track of all EOG tokens in the vocab #9609 , use in stop be that encountered.


Function Call Token Support

Special token handling for function call-style generation will be submitted in a separate PR.

@johnpyp
Copy link

johnpyp commented Apr 23, 2025

Function Call Token Support
Special token handling for function call-style generation will be submitted in a separate PR.

Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?

@glide-the
Copy link
Author

Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?

Function Call Compatibility for GLM

The main modification I made to support function call capabilities in GLM is ensuring that the special token <|observation|> correctly triggers its intended behavior. In GLM models, this token acts as a special stop word.

Example usage pattern:

<|user|>
text<|assistant|>test<|observation|>

In llama.cpp, there is no built-in mapping for <|observation|>, so I added it to the EOG (End of Generation) tokens list.

Loading Chain

Once the model is loaded, the following call chain is involved:

main()
→ llama_model_load_from_file()
→ llama_model_load()
→ llama_model_loader()

Sampling Chain

During sampling, the function llama_sampler_chain_apply is responsible for executing the model inference chain (llama_sampler_chain, smpl->ctx).
At the end of each sampling step, the logic checks for eos, eot, or eog tokens and triggers the corresponding stop behavior.

Refer to the implementation in main.cpp:

if (need_insert_eot && format_chat) {
    llama_token eot = llama_vocab_eot(vocab);
    embd_inp.push_back(eot == LLAMA_TOKEN_NULL ? llama_vocab_eos(vocab) : eot);
    need_insert_eot = false;
}
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]\n");
    break;
}

Regarding Function Call Token Implementation

To support tokens like <|observation|> for function call behavior, it may be sufficient to simply include it in the EOG detection logic.
I haven’t yet fully validated this approach, but it appears promising.

for (const auto & t : token_to_id) {
    // find EOT token: "<|eot_id|>", "<|im_end|>", "<end_of_turn>", etc.
    if (special_eot_id == LLAMA_TOKEN_NULL) {
        if (false
                || t.first == "<|eot_id|>"
                || t.first == "<|im_end|>"
                || t.first == "<|end|>"
                || t.first == "<end_of_turn>"
                || t.first == "<|endoftext|>"
                || t.first == "<EOT>"
                || t.first == "_<EOT>"
                || t.first == "<|end▁of▁sentence|>" // DeepSeek
           ) {
            special_eot_id = t.second;
            if ((id_to_token[t.second].attr & LLAMA_TOKEN_ATTR_CONTROL) == 0) {
                LLAMA_LOG_WARN("%s: control-looking token: %6d '%s' was not control-type; this is probably a bug in the model. its type will be overridden\n",
                        __func__, t.second, t.first.c_str());
                id_to_token[t.second].attr = LLAMA_TOKEN_ATTR_CONTROL;
            }
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants