From f78a67d648b3c806fa7e12a2e7ae1a810c5e0bef Mon Sep 17 00:00:00 2001 From: BryanBradfo Date: Mon, 25 Aug 2025 22:27:03 +0200 Subject: [PATCH 01/10] docs(pixtral): Update Pixtral model card to new format --- docs/source/en/model_doc/pixtral.md | 48 ++++++++++++----------------- 1 file changed, 19 insertions(+), 29 deletions(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index f287170a0e0f..b87b58f0ef8d 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -14,43 +14,26 @@ rendered properly in your Markdown viewer. --> -# Pixtral -
-PyTorch +
+
+ PyTorch +
-## Overview - -The Pixtral model was released by the Mistral AI team in a [blog post](https://mistral.ai/news/pixtral-12b/). Pixtral is a multimodal version of [Mistral](mistral), incorporating a 400 million parameter vision encoder trained from scratch. - -The intro from the blog says the following: - -*Pixtral is trained to understand both natural images and documents, achieving 52.5% on the MMMU reasoning benchmark, surpassing a number of larger models. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.* - - - - Pixtral architecture. Taken from the blog post. - -Tips: - -- Pixtral is a multimodal model, taking images and text as input, and producing text as output. -- This model follows the [Llava](llava) architecture. The model uses [`PixtralVisionModel`] for its vision encoder, and [`MistralForCausalLM`] for its language decoder. -- The main contribution is the 2d ROPE (rotary position embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized). -- Similar to [Llava](llava), the model internally replaces the `[IMG]` token placeholders by image embeddings from the vision encoder. The format for one or multiple prompts is the following: -``` -"[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]" -``` -Then, the processor will replace each `[IMG]` token with a number of `[IMG]` tokens that depend on the height and the width of each image. Each *row* of the image is separated by an `[IMG_BREAK]` token, and each image is separated by an `[IMG_END]` token. It's advised to use the `apply_chat_template` method of the processor, which takes care of all of this and formats the text for you. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the [usage section](#usage) for more info. +# Pixtral +[Pixtral](https://huggingface.co/papers/2410.07073) model was released by the Mistral AI team in a [blog post](https://mistral.ai/news/pixtral-12b/). It couples a 400 M-parameter vision encoder with a 12 B-parameter Mistral Nemo decoder and can ingest an arbitrary number of images at their native resolution (no resizing, no common padding) thanks to **2-D RoPE** position embeddings. -This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [ArthurZ](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/vllm-project/vllm/pull/8377). +You can find all the original Pixtral checkpoints under the [mistral-community](https://huggingface.co/mistral-community) collection. +> [!TIP] +> This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [ArthurZ](https://huggingface.co/ArthurZ). +> Click any Pixtral model in the right-hand sidebar to explore additional multimodal examples (VQA, document Q&A, chart understanding, etc.). -## Usage + -At inference time, it's advised to use the processor's `apply_chat_template` method, which correctly formats the prompt for the model: + ```python from transformers import AutoProcessor, LlavaForConditionalGeneration @@ -82,6 +65,13 @@ generate_ids = model.generate(**inputs, max_new_tokens=500) output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` + + +## Notes + +- Pixtral is a multimodal model that follows the [Llava](llava) architecture, using a [`PixtralVisionModel`] and a [`MistralForCausalLM`] decoder. +- The model internally replaces `[IMG]` token placeholders with image embeddings. To correctly format the prompt with text and images, it is highly recommended to use the `apply_chat_template` method of the processor, which handles the complex formatting automatically. + ## PixtralVisionConfig [[autodoc]] PixtralVisionConfig From 13ad41d673898d6623a056f0ea94797fafe96291 Mon Sep 17 00:00:00 2001 From: BryanBradfo Date: Mon, 25 Aug 2025 22:34:15 +0200 Subject: [PATCH 02/10] docs(pixtral): Change cuda into auto for device_map --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index b87b58f0ef8d..a383c1ca24b9 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -40,7 +40,7 @@ from transformers import AutoProcessor, LlavaForConditionalGeneration model_id = "mistral-community/pixtral-12b" processor = AutoProcessor.from_pretrained(model_id) -model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda") +model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto") chat = [ { From 648c8aa94f53868c0ba400cd953c677615d0691b Mon Sep 17 00:00:00 2001 From: Bryan <101939095+BryanBradfo@users.noreply.github.com> Date: Tue, 26 Aug 2025 21:45:34 +0200 Subject: [PATCH 03/10] docs(pixtral): Apply suggestions from review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index a7e3096858f0..36bc3d712239 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -24,7 +24,7 @@ rendered properly in your Markdown viewer. # Pixtral -[Pixtral](https://huggingface.co/papers/2410.07073) model was released by the Mistral AI team in a [blog post](https://mistral.ai/news/pixtral-12b/). It couples a 400 M-parameter vision encoder with a 12 B-parameter Mistral Nemo decoder and can ingest an arbitrary number of images at their native resolution (no resizing, no common padding) thanks to **2-D RoPE** position embeddings. +[Pixtral](https://huggingface.co/papers/2410.07073) is a multimodal model trained to understand natural images and documents. It accepts images in their natural resolution and aspect ratio without resizing or padding due to it's 2D RoPE embeddings. In addition, Pixtral has a long 128K token context window for processing a large number of images. Pixtral couples a 400M vision encoder with a 12B Mistral Nemo decoder. You can find all the original Pixtral checkpoints under the [mistral-community](https://huggingface.co/mistral-community) collection. From 810c95b5031058e5e9a7c853e75c9931e4ca3b65 Mon Sep 17 00:00:00 2001 From: Bryan <101939095+BryanBradfo@users.noreply.github.com> Date: Tue, 26 Aug 2025 21:46:19 +0200 Subject: [PATCH 04/10] docs(pixtral): Apply suggestions from review, changing mistral-community into Mistral AI Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index 36bc3d712239..4fbdefbd605e 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -26,7 +26,7 @@ rendered properly in your Markdown viewer. [Pixtral](https://huggingface.co/papers/2410.07073) is a multimodal model trained to understand natural images and documents. It accepts images in their natural resolution and aspect ratio without resizing or padding due to it's 2D RoPE embeddings. In addition, Pixtral has a long 128K token context window for processing a large number of images. Pixtral couples a 400M vision encoder with a 12B Mistral Nemo decoder. -You can find all the original Pixtral checkpoints under the [mistral-community](https://huggingface.co/mistral-community) collection. +You can find all the original Pixtral checkpoints under the [Mistral AI](https://huggingface.co/mistralai/models?search=pixtral) organization. > [!TIP] > This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [ArthurZ](https://huggingface.co/ArthurZ). From eed4f1bc20170abf69bf5313fa28b0c36fdaf80c Mon Sep 17 00:00:00 2001 From: Bryan <101939095+BryanBradfo@users.noreply.github.com> Date: Tue, 26 Aug 2025 21:46:44 +0200 Subject: [PATCH 05/10] docs(pixtral): Apply suggestions from review [!TIP] part Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index 4fbdefbd605e..388666d7539f 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -30,7 +30,7 @@ You can find all the original Pixtral checkpoints under the [Mistral AI](https:/ > [!TIP] > This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [ArthurZ](https://huggingface.co/ArthurZ). -> Click any Pixtral model in the right-hand sidebar to explore additional multimodal examples (VQA, document Q&A, chart understanding, etc.). +> Click on the Pixtral models in the right sidebar for more examples of how to apply Pixtral to different vision and language tasks. From 511fe4b9bdaa8706354c19c158a92d64a27022da Mon Sep 17 00:00:00 2001 From: BryanBradfo Date: Tue, 26 Aug 2025 23:05:59 +0200 Subject: [PATCH 06/10] docs(pixtral): Finalize model card with tested code examples This commit finalizes the update for the Pixtral model card. --- docs/source/en/model_doc/pixtral.md | 89 ++++++++++++++++++++++++----- 1 file changed, 76 insertions(+), 13 deletions(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index 388666d7539f..aef6c28d9eee 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -37,41 +37,104 @@ You can find all the original Pixtral checkpoints under the [Mistral AI](https:/ ```python +import torch from transformers import AutoProcessor, LlavaForConditionalGeneration model_id = "mistral-community/pixtral-12b" +model = LlavaForConditionalGeneration.from_pretrained(model_id, dtype="auto", device_map="auto") processor = AutoProcessor.from_pretrained(model_id) -model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto") + +url_dog = "https://picsum.photos/id/237/200/300" +url_mountain = "https://picsum.photos/seed/picsum/200/300" chat = [ { "role": "user", "content": [ {"type": "text", "content": "Can this animal"}, - {"type": "image", "url": "https://picsum.photos/id/237/200/300"}, + {"type": "image", "url": url_dog}, {"type": "text", "content": "live here?"}, - {"type": "image", "url": "https://picsum.photos/seed/picsum/200/300"} + {"type": "image", "url" : url_mountain} ] } ] -inputs = processor.apply_chat_template( - chat, - add_generation_prompt=True, - tokenize=True, - return_dict=True, - return_tensors="pt" -).to(model.device) - +inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt").to(model.device) generate_ids = model.generate(**inputs, max_new_tokens=500) output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [bitsandbytes](https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes) to quantize the model to 4-bits, making it runnable on consumer GPUs. + +```python +import torch +import requests +from PIL import Image +from transformers import AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig + +model_id = "mistral-community/pixtral-12b" + +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16 +) + +model = LlavaForConditionalGeneration.from_pretrained( + model_id, + quantization_config=quantization_config, + device_map="auto" +) +processor = AutoProcessor.from_pretrained(model_id) + +dog_url = "https://picsum.photos/id/237/200/300" +mountain_url = "https://picsum.photos/seed/picsum/200/300" +dog_image = Image.open(requests.get(dog_url, stream=True).raw) +mountain_image = Image.open(requests.get(mountain_url, stream=True).raw) + +chat = [ + { + "role": "user", "content": [ + {"type": "text", "text": "Can this animal"}, + {"type": "image"}, + {"type": "text", "text": "live here?"}, + {"type": "image"} + ] + } +] + +prompt = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) +inputs = processor(text=prompt, images=[dog_image, mountain_image], return_tensors="pt") + +inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype) +inputs = {k: v.to(model.device) for k, v in inputs.items()} + +generate_ids = model.generate(**inputs, max_new_tokens=100) +output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) +print(output) +``` + + + + Pixtral architecture. Taken from the blog post. + + ## Notes -- Pixtral is a multimodal model that follows the [Llava](llava) architecture, using a [`PixtralVisionModel`] and a [`MistralForCausalLM`] decoder. -- The model internally replaces `[IMG]` token placeholders with image embeddings. To correctly format the prompt with text and images, it is highly recommended to use the `apply_chat_template` method of the processor, which handles the complex formatting automatically. +- Pixtral uses [`PixtralVisionModel`] as the vision encoder and [`MistralForCausalLM`] for its language decoder. +- The model internally replaces `[IMG]` token placeholders with image embeddings. + + ```py + "[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]" + ``` + + The [IMG] tokens are replaced with a number of [IMG] tokens that depend on the height and width of each image. Each row of the image is separated by a [IMG_BREAK] token and each image is separated by a [IMG_END] token. Use the [~Processor.apply_chat_template] method to handle these tokens for you. ## PixtralVisionConfig From 5d42bcf96d7a08f20b1e9c5acbacb33a6842085b Mon Sep 17 00:00:00 2001 From: BryanBradfo Date: Wed, 27 Aug 2025 10:24:46 +0200 Subject: [PATCH 07/10] Fix the hfoption by the right one --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index aef6c28d9eee..481a3f51029a 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -63,7 +63,7 @@ generate_ids = model.generate(**inputs, max_new_tokens=500) output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` - + From 67b608c9c48e564f73f04e52285b5213ab092c3c Mon Sep 17 00:00:00 2001 From: Bryan <101939095+BryanBradfo@users.noreply.github.com> Date: Wed, 27 Aug 2025 19:54:14 +0200 Subject: [PATCH 08/10] @BryanBradfo docs(pixtral): Changing the redirection of bitsandbytes Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index 481a3f51029a..147190782009 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -69,7 +69,7 @@ output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -The example below uses [bitsandbytes](https://huggingface.co/docs/transformers/main/en/quantization#bitsandbytes) to quantize the model to 4-bits, making it runnable on consumer GPUs. +The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the model to 4-bits. ```python import torch From 77b8053d7959fd9727970a826afb000f00209598 Mon Sep 17 00:00:00 2001 From: Bryan <101939095+BryanBradfo@users.noreply.github.com> Date: Wed, 27 Aug 2025 19:55:36 +0200 Subject: [PATCH 09/10] docs(pixtral): Add of ` to highlight the tokens Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/pixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index 147190782009..dd35d78d3be4 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -134,7 +134,7 @@ alt="drawing" width="600"/> "[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]" ``` - The [IMG] tokens are replaced with a number of [IMG] tokens that depend on the height and width of each image. Each row of the image is separated by a [IMG_BREAK] token and each image is separated by a [IMG_END] token. Use the [~Processor.apply_chat_template] method to handle these tokens for you. + The `[IMG]` tokens are replaced with a number of `[IMG]` tokens that depend on the height and width of each image. Each row of the image is separated by a `[IMG_BREAK]` token and each image is separated by a `[IMG_END]` token. Use the [`~Processor.apply_chat_template`] method to handle these tokens for you. ## PixtralVisionConfig From c1397d61694450baf9f64cce05f0d2b2ed2dc27b Mon Sep 17 00:00:00 2001 From: BryanBradfo Date: Wed, 27 Aug 2025 19:58:24 +0200 Subject: [PATCH 10/10] docs(pixtral): Move image block per final review --- docs/source/en/model_doc/pixtral.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index dd35d78d3be4..55ba09084292 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -26,6 +26,11 @@ rendered properly in your Markdown viewer. [Pixtral](https://huggingface.co/papers/2410.07073) is a multimodal model trained to understand natural images and documents. It accepts images in their natural resolution and aspect ratio without resizing or padding due to it's 2D RoPE embeddings. In addition, Pixtral has a long 128K token context window for processing a large number of images. Pixtral couples a 400M vision encoder with a 12B Mistral Nemo decoder. + + + Pixtral architecture. Taken from the blog post. + You can find all the original Pixtral checkpoints under the [Mistral AI](https://huggingface.co/mistralai/models?search=pixtral) organization. > [!TIP] @@ -119,12 +124,6 @@ output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up print(output) ``` - - - Pixtral architecture. Taken from the blog post. - - ## Notes - Pixtral uses [`PixtralVisionModel`] as the vision encoder and [`MistralForCausalLM`] for its language decoder.