Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Chat Formats for Multimodal Models (Obsidian, LLaVA1.6, Moondream) #1147

Merged
merged 26 commits into from
Apr 30, 2024

Conversation

abetlen
Copy link
Owner

@abetlen abetlen commented Jan 31, 2024

The Llava1.5 chat fomat is hard coded and makes it really hard to extend for new VLMs

Goals:

  • Easily configurable chat templates so we can quickly add support for models with same projector but different chat formats
  • Ability to use external library for projections (transformers, pytorch, etc) and just load the token embeddings into a llama.cpp model

Usage

Python

>>> from llama_cpp import Llama
>>> from llama_cpp.llama_chat_format import MoondreamChatHandler
>>> chat_handler = MoondreamChatHandler.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*mmproj*",
)
>>> llm = Llama.from_pretrained(
  repo_id="vikhyatk/moondream2"
  filename="*text-model*",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
)
>>> llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)

Server Config

host: "0.0.0.0"
models:
  - model: "*text-model*"
    clip_model_path: "*mmproj*"
    hf_model_repo_id: vikhyatk/moondream2
    model_alias: "gpt-4-turbo"
    chat_format: moondream
    n_threads_batch: -1
    n_gpu_layers: -1
    verbose: true

Overview

The approach / solution I'm taking here is to map directly from OpenAI ChatCompletionRequestMessage lists to a chat format using jinja2, for images the approach is render the image url into the prompt and then split on those image urls. The chat completion handler than decode's the text and image segments of the prompt and finally starts generation.

Example Llava 1.5 Jinja Chat Format

    CHAT_FORMAT = (
        "{% for message in messages %}"
        "{% if message.role == 'system' %}"
        "{{ message.content }}"
        "{% endif %}"
        "{% if message.role == 'user' %}"
        "{% if message.content is string %}"
        "\nUSER: {{ message.content }}"
        "{% elif message.content is iterable %}"
        "\nUSER: "
        "{% for content in message.content %}"
        "{% if content.type == 'text' %}"
        "{{ content.text }}"
        "{% endif %}"
        "{% if content.type == 'image_url' and content.image_url is string %}"
        "{{ content.image_url }}"
        "{% endif %}"
        "{% if content.type == 'image_url' and content.image_url is mapping %}"
        "{{ content.image_url.url }}"
        "{% endif %}"
        "{% endfor %}"
        "{% endif %}"
        "{% endif %}"
        "{% if message.role == 'assistant' and message.content is not none %}"
        "\nASSISTANT: {{ message.content }}"
        "{% endif %}"
        "{% endfor %}"
        "{% if add_generation_prompt %}"
        "\nASSISTANT: "
        "{% endif %}"
    )

Progress:

  • Refactored llava chat format to use jinja2 template string
  • (extra) Added ability to pull image model directly from huggingface via from_pretrained for model and server
  • Add Moondream chat format
  • (extra) Add tool and function calling support
  • Cache image encoding between requests
  • Add Llava1.6 chat format
  • Add NanoLlava chat format
  • Add MobileVLM chat format
  • Add Obsidian chat format
  • Cleanup implementation
  • Update docs
  • (extra) prompt-prefix caching
  • (extra) convert unsupported image formats using optional Pillow dependency
  • (extra) bring-your-own image encoder option

Closes #1301
Closes #1204

@abetlen abetlen marked this pull request as ready for review April 27, 2024 17:00
@abetlen abetlen changed the title Generic Chat Formats for Multimodal Models (WIP) Generic Chat Formats for Multimodal Models (Obsidian, LLaVA1.6, Moondream) Apr 30, 2024
@abetlen abetlen merged commit fe2da09 into main Apr 30, 2024
16 checks passed
xhedit pushed a commit to xhedit/llama-cpp-conv that referenced this pull request Apr 30, 2024
…t for Multimodal Models (Obsidian, LLaVA1.6, Moondream) (abetlen#1147)

* Test dummy image tags in chat templates

* Format and improve  types for llava_cpp.py

* Add from_pretrained support to llava chat format.

* Refactor llava chat format to use a jinja2

* Revert chat format test

* Add moondream support (wip)

* Update moondream chat format

* Update moondream chat format

* Update moondream prompt

* Add function calling support

* Cache last image embed

* Add Llava1.6 support

* Add nanollava support

* Add obisidian support

* Remove unnecessary import

* Re-order multimodal chat formats

* Logits all no longer required for multi-modal models

* Update README.md

* Update docs

* Update README

* Fix typo

* Update README

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing LLava 1.6 support for handling custom templates with the respect of the chosen LLM. LLAVA v1.6
1 participant