Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Duplicate BOS Token in Hugging Face Chat Templates #618

Closed
Tracked by #1151
Van-QA opened this issue May 28, 2024 · 4 comments
Closed
Tracked by #1151

bug: Duplicate BOS Token in Hugging Face Chat Templates #618

Van-QA opened this issue May 28, 2024 · 4 comments
Assignees
Labels
category: engine management Related to engine abstraction category: model running Inference ux, handling context/parameters, runtime P1: important Important feature / fix type: bug Something isn't working
Milestone

Comments

@Van-QA
Copy link
Contributor

Van-QA commented May 28, 2024

Description:
When using chat templates in Hugging Face, the Beginning-OfSentence (BOS) token is often already included in the template. However, Llama.cpp also automatically adds the BOS token, resulting in a duplicate BOS token.

Expected behavior:
The system should automatically detect and remove any duplicate BOS tokens in the chat template. This would ensure proper functioning of the chat system without causing errors due to redundant tokens.

Additional context:
This issue may cause unexpected behavior or errors in the chat system. It is recommended that Cortex checks for and deduplicates the BOS token if it is present in the user's template to maintain a consistent and error-free chat experience.

@Van-QA Van-QA added the type: bug Something isn't working label May 28, 2024
@louis-menlo louis-menlo self-assigned this May 31, 2024
@imtuyethan imtuyethan moved this from Icebox to Need Investigation in Menlo Sep 2, 2024
@dan-menlo dan-menlo moved this from Need Investigation to Scheduled in Menlo Sep 3, 2024
@dan-menlo dan-menlo added category: model running Inference ux, handling context/parameters, runtime category: engine management Related to engine abstraction labels Sep 6, 2024
@freelerobot freelerobot added the P1: important Important feature / fix label Sep 6, 2024
@dan-menlo
Copy link
Contributor

@nguyenhoangthuan99 I am putting this as a sub-issue of #1151. This issue may be stale if HF has fixed upstream

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 25, 2024

Problem
This is example of huggingface chat template that we are using <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

The token <|begin_of_text|> is BOS token. However, when we use this chat template, the llama.cpp engine automatically add <|begin_of_text|> to the prompt -> there are 2 BOS token in the prompt e.g. <|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|>.

This issue won't affect too much to the response, and most models still work well with this. We are using cortex with double BOS token with llama.cpp engine without any error.

Solution
There are 2 solutions right now for this:

  • Handle at cortex.llamacpp engine. Every time, when run inference we need to check if there is BOS token of that model in the chat template. the BOS token is parsed from GGUF file by llama.cpp
  • Handle at cortex.cpp side. We need to save BOS token to model.yml and let cortex.cpp remove it when request to cortex.llamacpp engine.

I think the first solution is better because we don't need to update the cortex.cpp part and model.yml file which is related to many other parts of cortex.cpp.

This issue is not critical, and it needs effort to read llama.cpp source to handle.

@nguyenhoangthuan99 nguyenhoangthuan99 moved this from Scheduled to Need Investigation in Menlo Sep 25, 2024
@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 30, 2024

I want to close this issue because llama.cpp set tokenizer_add_bos default to false on this PR
Image

For that reason, this bug no longer appear when running inference with llama.cpp.
Tested with cortex.cpp, when running inference, no warning log about duplicated BOS token appeared.
Image

@nguyenhoangthuan99 nguyenhoangthuan99 moved this from Investigating to In Review in Menlo Sep 30, 2024
@nguyenhoangthuan99 nguyenhoangthuan99 moved this from In Review to Review + QA in Menlo Sep 30, 2024
@gabrielle-ong
Copy link
Contributor

Solved by llama.cpp upstream (thanks @nguyenhoangthuan99 for investigating)

@gabrielle-ong gabrielle-ong closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2024
@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Menlo Oct 3, 2024
@gabrielle-ong gabrielle-ong added this to the v1.0.0 milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: engine management Related to engine abstraction category: model running Inference ux, handling context/parameters, runtime P1: important Important feature / fix type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

7 participants