You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think that the problem could be related to the extracted chat_template, in Hugginface are using without problems "tokenizer.apply_chat_template" but i don't know if llama.cpp implementation works like that
The text was updated successfully, but these errors were encountered:
@ngxson Could you take a look if this is caused by the chat template changes?
I know we verify if the custom template is supported (if provided on the command line) but I don't think we check if the model's built-in template is supported and this might be causing the crash.
We should check that and either fallback to some standard template and print noticeable warning and / or write some short instructions for adding support for new chat templates in llama.cpp so that people submit PRs
Thanks for the detail bug report, I'll look into this today. Seems like it's really because we haven't support these templates. In this case, we can fallback to chatml.
I'll add them to the list of supported template too.
This problem started to happen yesterday when I started using the new image after the implementation, before that, the models did not give good results because of the inadequate chat template, but it did not crash.
Thank you very much for the fast response :)
Description
Server returns "500 Internal Server Error\nvector::_M_default_append" when using certain models trying to use model's template with docker cuda image.
Steps to Reproduce
I'm using openAI in python:
def api_openai(placeholder, system_prompt, user_prompt, temperature, logit_bias):
full_response = ""
for response in openai_client.chat.completions.create(
model=st.session_state["openai_model"],
messages=[{"role": "system",
"content": system_prompt},
{"role": "user",
"content": user_prompt}],
stream=True, temperature=temperature, frequency_penalty=1, logit_bias=logit_bias):
full_response += (response.choices[0].delta.content or "")
placeholder.info(full_response + "▌")
Actual Behavior
"500 Internal Server Error\nvector::_M_default_append"
Screenshots
Environment
Operating System: Docker
Docker compose:
api-server:
container_name: api-server
image: ghcr.io/ggerganov/llama.cpp:server-cuda
command: >
-m models/alphamonarch-7b.Q5_K_M.gguf
--ctx-size 8192
--host 0.0.0.0
--port 8080
--n-gpu-layers 1000
-np 1
-cb
--grp-attn-n 4
--grp-attn-w 2048
--api-key key
--verbose
ports:
- "8080:8080"
Models that failed:
https://huggingface.co/mlabonne/AlphaMonarch-7B-GGUF
https://huggingface.co/CultriX/OmniBeagle-7B-GGUF
Additional Information
Models that I've tried that works:
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
https://huggingface.co/brittlewis12/NeuralDaredevil-7B-GGUF
Related Issues
i used #5593
Proposed Solution
I think that the problem could be related to the extracted chat_template, in Hugginface are using without problems "tokenizer.apply_chat_template" but i don't know if llama.cpp implementation works like that
The text was updated successfully, but these errors were encountered: