Skip to content

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

@DarkLight1337

Description

@DarkLight1337

🚀 The feature, motivation and pitch

Currently, we expect image_url, audio_url etc. to be inside the messages that are passed to the chat template. We would like to expand this to supporting image, audio etc. inputs, just like in HuggingFace Transformers:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

To avoid having to pass multi-modal inputs separately, we propose the following extension:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

This lets us pass multi-modal data such as PIL images to LLM.chat directly without having to encode them into base64 URLs.

Alternatives

No response

Additional context

cc @ywang96 @Isotr0py @hmellor

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions