Skip to content

Commit 4ce5b68

Browse files
authored
docs: create MistralOCRDocumentConverter docs (#9952)
* converter-docs * upd-reference-link
1 parent 2d0fd42 commit 4ce5b68

File tree

6 files changed

+341
-1
lines changed

6 files changed

+341
-1
lines changed

docs-website/docs/pipeline-components/converters.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Use various Converters to extract data from files in different formats and cast
2020
| [ImageFileToImageContent](converters/imagefiletoimagecontent.mdx) | Reads local image files and converts them into `ImageContent` objects. |
2121
| [JSONConverter](converters/jsonconverter.mdx) | Converts JSON files to text documents. |
2222
| [MarkdownToDocument](converters/markdowntodocument.mdx) | Converts markdown files to documents. |
23+
| [MistralOCRDocumentConverter](converters/mistralocrdocumentconverter.mdx) | Extracts text from documents using Mistral's OCR API, with optional structured annotations. |
2324
| [MSGToDocument](converters/msgtodocument.mdx) | Converts Microsoft Outlook .msg files to documents. |
2425
| [MultiFileConverter](converters/multifileconverter.mdx) | Converts CSV, DOCX, HTML, JSON, MD, PPTX, PDF, TXT, and XSLX files to documents. |
2526
| [OpenAPIServiceToFunctions](converters/openapiservicetofunctions.mdx) | Transforms OpenAPI service specifications into a format compatible with OpenAI's function calling mechanism. |
@@ -31,4 +32,4 @@ Use various Converters to extract data from files in different formats and cast
3132
| [TikaDocumentConverter](converters/tikadocumentconverter.mdx) | Converts various file types to documents using Apache Tika. |
3233
| [TextFileToDocument](converters/textfiletodocument.mdx) | Converts text files to documents. |
3334
| [UnstructuredFileConverter](converters/unstructuredfileconverter.mdx) | Converts text files and directories to a document. |
34-
| [XLSXToDocument](converters/xlsxtodocument.mdx) | Converts Excel files into documents. |
35+
| [XLSXToDocument](converters/xlsxtodocument.mdx) | Converts Excel files into documents. |
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
title: "MistralOCRDocumentConverter"
3+
id: mistralocrdocumentconverter
4+
slug: "/mistralocrdocumentconverter"
5+
description: "`MistralOCRDocumentConverter` extracts text from documents using Mistral's OCR API, with optional structured annotations for both individual image regions and full documents. It supports various input formats including local files, URLs, and Mistral file IDs."
6+
---
7+
8+
# MistralOCRDocumentConverter
9+
10+
`MistralOCRDocumentConverter` extracts text from documents using Mistral's OCR API, with optional structured annotations for both individual image regions and full documents. It supports various input formats including local files, URLs, and Mistral file IDs.
11+
12+
| | |
13+
| --- | --- |
14+
| **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx), or right at the beginning of an indexing pipeline |
15+
| **Mandatory init variables** | "api_key": The Mistral API key. Can be set with `MISTRAL_API_KEY` environment variable. |
16+
| **Mandatory run variables** | "sources": A list of document sources (file paths, ByteStreams, URLs, or Mistral chunks) |
17+
| **Output variables** | "documents": A list of documents <br /> <br />"raw_mistral_response": A list of raw OCR responses from Mistral API |
18+
| **API reference** | [Mistral](/reference/integrations-mistral) |
19+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral |
20+
21+
## Overview
22+
23+
The `MistralOCRDocumentConverter` takes a list of document sources and uses Mistral's OCR API to extract text from images and PDFs. It supports multiple input formats:
24+
25+
- **Local files**: File paths (str or Path) or [`ByteStream`](../../concepts/data-classes.mdx#bytestresm) objects
26+
- **Remote resources**: Document URLs, image URLs using Mistral's `DocumentURLChunk` and `ImageURLChunk`
27+
- **Mistral storage**: File IDs using Mistral's `FileChunk` for files previously uploaded to Mistral
28+
29+
The component returns one Haystack [`Document`](../../concepts/data-classes.mdx#document) per source, with all pages concatenated using form feed characters (`\f`) as separators. This format ensures compatibility with Haystack's [`DocumentSplitter`](../preprocessors/documentsplitter.mdx) for accurate page-wise splitting and overlap handling. The content is returned in markdown format, with images represented as `![img-id](img-id)` tags.
30+
31+
By default, the component uses the `MISTRAL_API_KEY` environment variable for authentication. You can also pass an `api_key` at initialization. Local files are automatically uploaded to Mistral's storage for processing and deleted afterward (configurable with `cleanup_uploaded_files`).
32+
33+
When you initialize the component, you can optionally specify which pages to process, set limits on image extraction, configure minimum image sizes, or include base64-encoded images in the response. The default model is `"mistral-ocr-2505"`. See the [Mistral models documentation](https://docs.mistral.ai/getting-started/models/models_overview/) for available models.
34+
35+
### Structured Annotations
36+
37+
A unique feature of `MistralOCRDocumentConverter` is its support for structured annotations using Pydantic schemas:
38+
39+
- **Bounding box annotations** (`bbox_annotation_schema`): Annotate individual image regions with structured data (for example, image type, description, summary). These annotations are inserted inline after the corresponding image tags in the markdown content.
40+
- **Document annotations** (`document_annotation_schema`): Annotate the full document with structured data (for example, language, chapter titles, URLs). These annotations are unpacked into the document's metadata with a `source_` prefix (for example, `source_language`, `source_chapter_titles`).
41+
42+
When annotation schemas are provided, the OCR model first extracts text and structure, then a Vision LLM analyzes the content and generates structured annotations according to your defined Pydantic schemas. Note that document annotation is limited to a maximum of 8 pages. For more details, see the [Mistral documentation on annotations](https://docs.mistral.ai/capabilities/document_ai/annotations/).
43+
44+
## Usage
45+
46+
You need to install the `mistral-haystack` integration to use `MistralOCRDocumentConverter`:
47+
48+
```shell
49+
pip install mistral-haystack
50+
```
51+
52+
### On its own
53+
54+
Basic usage with a local file:
55+
56+
```python
57+
from pathlib import Path
58+
from haystack.utils import Secret
59+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
60+
61+
converter = MistralOCRDocumentConverter(
62+
api_key=Secret.from_env_var("MISTRAL_API_KEY"),
63+
model="mistral-ocr-2505"
64+
)
65+
66+
result = converter.run(sources=[Path("my_document.pdf")])
67+
documents = result["documents"]
68+
```
69+
70+
Processing multiple sources with different types:
71+
72+
```python
73+
from pathlib import Path
74+
from haystack.utils import Secret
75+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
76+
from mistralai.models import DocumentURLChunk, ImageURLChunk
77+
78+
converter = MistralOCRDocumentConverter(
79+
api_key=Secret.from_env_var("MISTRAL_API_KEY"),
80+
model="mistral-ocr-2505"
81+
)
82+
83+
sources = [
84+
Path("local_document.pdf"),
85+
DocumentURLChunk(document_url="https://example.com/document.pdf"),
86+
ImageURLChunk(image_url="https://example.com/receipt.jpg"),
87+
]
88+
89+
result = converter.run(sources=sources)
90+
documents = result["documents"] # List of 3 Documents
91+
raw_responses = result["raw_mistral_response"] # List of 3 raw responses
92+
```
93+
94+
Using structured annotations:
95+
96+
```python
97+
from pathlib import Path
98+
from typing import List
99+
from pydantic import BaseModel, Field
100+
from haystack.utils import Secret
101+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
102+
from mistralai.models import DocumentURLChunk
103+
104+
# Define schema for image region annotations
105+
class ImageAnnotation(BaseModel):
106+
image_type: str = Field(..., description="The type of image content")
107+
short_description: str = Field(..., description="Short natural-language description")
108+
summary: str = Field(..., description="Detailed summary of the image content")
109+
110+
# Define schema for document-level annotations
111+
class DocumentAnnotation(BaseModel):
112+
language: str = Field(..., description="Primary language of the document")
113+
chapter_titles: List[str] = Field(..., description="Detected chapter or section titles")
114+
urls: List[str] = Field(..., description="URLs found in the text")
115+
116+
converter = MistralOCRDocumentConverter(
117+
api_key=Secret.from_env_var("MISTRAL_API_KEY"),
118+
model="mistral-ocr-2505"
119+
)
120+
121+
sources = [DocumentURLChunk(document_url="https://example.com/report.pdf")]
122+
result = converter.run(
123+
sources=sources,
124+
bbox_annotation_schema=ImageAnnotation,
125+
document_annotation_schema=DocumentAnnotation,
126+
)
127+
128+
documents = result["documents"]
129+
# Document metadata will include:
130+
# - source_language: extracted from DocumentAnnotation
131+
# - source_chapter_titles: extracted from DocumentAnnotation
132+
# - source_urls: extracted from DocumentAnnotation
133+
# Document content will include inline image annotations
134+
```
135+
136+
### In a pipeline
137+
138+
Here's an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:
139+
140+
```python
141+
from haystack import Pipeline
142+
from haystack.document_stores.in_memory import InMemoryDocumentStore
143+
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
144+
from haystack.components.writers import DocumentWriter
145+
from haystack.utils import Secret
146+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
147+
148+
document_store = InMemoryDocumentStore()
149+
150+
pipeline = Pipeline()
151+
pipeline.add_component(
152+
"converter",
153+
MistralOCRDocumentConverter(
154+
api_key=Secret.from_env_var("MISTRAL_API_KEY"),
155+
model="mistral-ocr-2505"
156+
)
157+
)
158+
pipeline.add_component("cleaner", DocumentCleaner())
159+
pipeline.add_component("splitter", DocumentSplitter(split_by="page", split_length=1))
160+
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
161+
162+
pipeline.connect("converter", "cleaner")
163+
pipeline.connect("cleaner", "splitter")
164+
pipeline.connect("splitter", "writer")
165+
166+
file_paths = ["invoice.pdf", "receipt.jpg", "contract.pdf"]
167+
pipeline.run({"converter": {"sources": file_paths}})
168+
```

docs-website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,7 @@ export default {
239239
'pipeline-components/converters/imagefiletoimagecontent',
240240
'pipeline-components/converters/jsonconverter',
241241
'pipeline-components/converters/markdowntodocument',
242+
'pipeline-components/converters/mistralocrdocumentconverter',
242243
'pipeline-components/converters/msgtodocument',
243244
'pipeline-components/converters/multifileconverter',
244245
'pipeline-components/converters/openapiservicetofunctions',

docs-website/versioned_docs/version-2.19/pipeline-components/converters.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Use various Converters to extract data from files in different formats and cast
2020
| [ImageFileToImageContent](doc:imagefiletoimagecontent) | Reads local image files and converts them into `ImageContent` objects. |
2121
| [JSONConverter](doc:jsonconverter) | Converts JSON files to text documents. |
2222
| [MarkdownToDocument](/docs/markdowntodocument) | Converts markdown files to documents. |
23+
| [MistralOCRDocumentConverter](/docs/mistralocrdocumentconverter) | Extracts text from documents using Mistral's OCR API, with optional structured annotations. |
2324
| [MSGToDocument](doc:msgtodocument) | Converts Microsoft Outlook .msg files to documents. |
2425
| [MultiFileConverter](doc:multifileconverter) | Converts CSV, DOCX, HTML, JSON, MD, PPTX, PDF, TXT, and XSLX files to documents. |
2526
| [OpenAPIServiceToFunctions](/docs/openapiservicetofunctions) | Transforms OpenAPI service specifications into a format compatible with OpenAI's function calling mechanism. |

0 commit comments

Comments
 (0)