[Feature]: Support HF-style chat template for multi-modal data in offline chat

### 🚀 The feature, motivation and pitch

Currently, we expect `image_url`, `audio_url` etc. to be inside the messages that are passed to the chat template. We would like to expand this to supporting `image`, `audio` etc. inputs, just like in HuggingFace Transformers:

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]
````

To avoid having to pass multi-modal inputs separately, we propose the following extension:

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]
````

This lets us pass multi-modal data such as PIL images to `LLM.chat` directly without having to encode them into base64 URLs.


### Alternatives

_No response_

### Additional context

cc @ywang96 @Isotr0py @hmellor 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support HF-style chat template for multi-modal data in offline chat #17551

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions