Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minicpm-o and qwen2-vl to the list of supported multimodal models. #1904

Open
kseyhan opened this issue Jan 24, 2025 · 9 comments
Open

Comments

@kseyhan
Copy link

kseyhan commented Jan 24, 2025

Support for the Qwen2-VL and MiniCPM-o models would be nice. They already have have been merged into the llava subproject of llama.cpp.

@lelefontaa
Copy link

+1

@kseyhan
Copy link
Author

kseyhan commented Feb 5, 2025

hmm, just tested again. maybe was me or i did pull an outdated llama or what last time. minicpm-o seems to work with the "minicpm-v-2.6" chat handler.

@la1ty
Copy link

la1ty commented Feb 9, 2025

Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.

I try to use the example chat template from llama.cpp but it still generate random characters...

@samkoesnadi
Copy link

Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.

I try to use the example chat template from llama.cpp but it still generate random characters...

This is interesting. Could you give us the GGUF model urls you are using?

@la1ty
Copy link

la1ty commented Feb 9, 2025

@samkoesnadi I downloaded them from HuggingFace. Hope you have some good news.

@kseyhan
Copy link
Author

kseyhan commented Feb 9, 2025

@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.

i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:

{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}

the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.

@samkoesnadi
Copy link

@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.

i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:

{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}

the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.

@la1ty could you guys try the 2B and see if it works? That's the one I tested...

@kseyhan
Copy link
Author

kseyhan commented Feb 9, 2025

@samkoesnadi which chat handler did you use if i may ask? the exact url to the model you used there would be usefull aswell.

@la1ty
Copy link

la1ty commented Feb 10, 2025

@kseyhan Yes, that's what I exactly experienced.

And I don't know if I make errors in compiling, but I found that text responses generating by Qwen2-VL-7b with llama-cpp-python v0.3.7 are mostly nonsense, which is not identical to the behavior in llama-cli.exe. Maybe I need to recompile it with the latest version of llama.cpp.

@samkoesnadi Yes it works with llama-cli.exe and llama-qwen2vl-cli.exe in llama.cpp, though llama-qwen2vl-cli.exe has an encoding problem for non-ascii characters on Windows platform?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants