diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md
index 3ba34c77205e5..acbe27a22a679 100644
--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
@@ -322,7 +322,7 @@ See [this page](#generative-models) for more information on how to use generativ
- ✅︎
- ✅︎
* - `Qwen2ForCausalLM`
- - Qwen2
+ - QwQ, Qwen2
- `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc.
- ✅︎
- ✅︎
@@ -436,7 +436,7 @@ loaded. See [relevant issue on HF Transformers](https://github.com/huggingface/t
```
If your model is not in the above list, we will try to automatically convert the model using
-{func}`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
+{func}`~vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
#### Reward Modeling (`--task reward`)
@@ -468,7 +468,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
```
If your model is not in the above list, we will try to automatically convert the model using
-{func}`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
+{func}`~vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
```{important}
For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
@@ -499,7 +499,7 @@ e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "r
```
If your model is not in the above list, we will try to automatically convert the model using
-{func}`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
+{func}`~vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring (`--task score`)
@@ -550,6 +550,28 @@ On the other hand, modalities separated by `/` are mutually exclusive.
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
+````{important}
+To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
+or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
+
+Offline inference:
+```python
+llm = LLM(
+ model="Qwen/Qwen2-VL-7B-Instruct",
+ limit_mm_per_prompt={"image": 4},
+)
+```
+
+Online inference:
+```bash
+vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
+```
+````
+
+```{note}
+vLLM currently only supports adding LoRA to the language backbone of multimodal models.
+```
+
### Generative Models
See [this page](#generative-models) for more information on how to use generative models.
@@ -689,14 +711,14 @@ See [this page](#generative-models) for more information on how to use generativ
* - `Phi3VForCausalLM`
- Phi-3-Vision, Phi-3.5-Vision
- T + IE+
- - `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct` etc.
+ - `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc.
-
- ✅︎
- ✅︎
* - `PixtralForConditionalGeneration`
- Pixtral
- T + I+
- - `mistralai/Pixtral-12B-2409`, `mistral-community/pixtral-12b` etc.
+ - `mistralai/Pixtral-12B-2409`, `mistral-community/pixtral-12b` (see note), etc.
-
- ✅︎
- ✅︎
@@ -715,7 +737,7 @@ See [this page](#generative-models) for more information on how to use generativ
- ✅︎
- ✅︎
* - `Qwen2VLForConditionalGeneration`
- - Qwen2-VL
+ - QVQ, Qwen2-VL
- T + IE+ + VE+
- `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc.
- ✅︎
@@ -733,26 +755,6 @@ See [this page](#generative-models) for more information on how to use generativ
E Pre-computed embeddings can be inputted for this modality.
+ Multiple items can be inputted per text prompt for this modality.
-````{important}
-To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
-or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
-
-```python
-llm = LLM(
- model="Qwen/Qwen2-VL-7B-Instruct",
- limit_mm_per_prompt={"image": 4},
-)
-```
-
-```bash
-vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
-```
-````
-
-```{note}
-vLLM currently only supports adding LoRA to the language backbone of multimodal models.
-```
-
```{note}
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
```
@@ -762,6 +764,11 @@ The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`
For more details, please see:
```
+```{note}
+The chat template for Pixtral-HF is incorrect (see [discussion](https://huggingface.co/mistral-community/pixtral-12b/discussions/22)).
+A corrected version is available at .
+```
+
### Pooling Models
See [this page](pooling-models) for more information on how to use pooling models.
diff --git a/examples/template_pixtral_hf.jinja b/examples/template_pixtral_hf.jinja
new file mode 100644
index 0000000000000..e94661cb39071
--- /dev/null
+++ b/examples/template_pixtral_hf.jinja
@@ -0,0 +1,38 @@
+{%- if messages[0]["role"] == "system" %}
+ {%- set system_message = messages[0]["content"] %}
+ {%- set loop_messages = messages[1:] %}
+{%- else %}
+ {%- set loop_messages = messages %}
+{%- endif %}
+
+{{- bos_token }}
+{%- for message in loop_messages %}
+ {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
+ {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}
+ {%- endif %}
+ {%- if message["role"] == "user" %}
+ {%- if loop.last and system_message is defined %}
+ {{- "[INST]" + system_message + "\n" }}
+ {%- else %}
+ {{- "[INST]" }}
+ {%- endif %}
+ {%- if message["content"] is not string %}
+ {%- for chunk in message["content"] %}
+ {%- if chunk["type"] == "text" %}
+ {{- chunk["text"] }}
+ {%- elif chunk["type"] == "image" %}
+ {{- "[IMG]" }}
+ {%- else %}
+ {{- raise_exception("Unrecognized content type!") }}
+ {%- endif %}
+ {%- endfor %}
+ {%- else %}
+ {{- message["content"] }}
+ {%- endif %}
+ {{- "[/INST]" }}
+ {%- elif message["role"] == "assistant" %}
+ {{- message["content"] + eos_token}}
+ {%- else %}
+ {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}
+ {%- endif %}
+{%- endfor %}
diff --git a/tests/entrypoints/test_chat_utils.py b/tests/entrypoints/test_chat_utils.py
index d63b963522e73..8f242df4a60e3 100644
--- a/tests/entrypoints/test_chat_utils.py
+++ b/tests/entrypoints/test_chat_utils.py
@@ -758,6 +758,7 @@ def test_resolve_content_format_hf_defined(model, expected_format):
("template_falcon.jinja", "string"),
("template_inkbot.jinja", "string"),
("template_llava.jinja", "string"),
+ ("template_pixtral_hf.jinja", "openai"),
("template_vlm2vec.jinja", "openai"),
("tool_chat_template_granite_20b_fc.jinja", "string"),
("tool_chat_template_hermes.jinja", "string"),