feat: mistral ocr converter (#2376)

Hansehart · anakin87 · web-flow · commit 890eb5ccbfef · 2025-10-23T09:25:48.000+02:00
* Revise MCPTool usage example for Streamable HTTP Updated example usage in MCPTool documentation to reflect Streamable HTTP usage and mentioned deprecated SSE. * Clarify connection types in MCPToolset documentation Updated documentation to reflect changes in connection types for MCPToolset. * fix: Align with hatch run fmt requirements * add: MistralOCRDocumentConverter * add: project files * fix: example lib usage * move: ocr document converter into child /mistral * add: example usage with annotations * add: hatch run fmt * add: mistralai * fix: python3.9 compatibility with using Union, List, Optional * add: new comments and their position * add: moved schemas from init into run to bypass problems with serializing * add: docstring convention * add: process mutliple documents * add: robust api handling with catching mistral errors * add: Union[str, Path, ByteStream] as input * add: comment for new inputs * add: pipeline example * fix: example ocr component * fix: mistral file upload and pydantic v2 models * add: pipeline example * add: hint on document annotation page limit * add: mistralai as project dependency * fix: hatch run fmt * fix: hatch run docs * add: exlcuse mistral from compliance workflow (its apache 2.0) * add: 3 initialization tests * add: 4 se test * add: test w/ proper mocking * add: real api test when env is set * add: delete files by default from mistral if uploaded * fix: mock file deletion * fix: hatch run fmt * Apply suggestion from @anakin87 Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update integrations/mistral/src/haystack_integrations/components/converters/mistral/ocr_document_converter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * fix: nested try excepts * add: mention file upload * Update integrations/mistral/tests/test_ocr_document_converter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update integrations/mistral/tests/test_ocr_document_converter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * add: less test code due to pytest.mark..parametrize * add: less tests and const class type * fix: format * add: ocr document converter to docusaurus * add: converter to mistral --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
diff --git a/.github/workflows/CI_license_compliance.yml b/.github/workflows/CI_license_compliance.yml
@@ -13,13 +13,14 @@ on:
 env:
   CORE_DATADOG_API_KEY: ${{ secrets.CORE_DATADOG_API_KEY }}
   PYTHON_VERSION: "3.10"
-  EXCLUDE_PACKAGES: "(?i)^(azure-identity|fastembed|ragas|tqdm|psycopg).*"
+  EXCLUDE_PACKAGES: "(?i)^(azure-identity|fastembed|ragas|tqdm|psycopg|mistralai).*"
 
   # Exclusions must be explicitly motivated
   #
   # - azure-identity is MIT but the license is not available on PyPI
   # - fastembed is Apache 2.0 but the license on PyPI is unclear ("Other/Proprietary License (Apache License)")
   # - ragas is Apache 2.0 but the license is not available on PyPI
+  # - mistralai is Apache 2.0 but the license is not available on PyPI
 
   # - tqdm is MLP but there are no better alternatives
   # - psycopg is LGPL-3.0 but FOSSA is fine with it
diff --git a/README.md b/README.md
@@ -46,7 +46,7 @@ Please check out our [Contribution Guidelines](CONTRIBUTING.md) for all the deta
 | [llama-stack-haystack](integrations/llama_stack/)                                                                  | Generator                   | [![PyPI - Version](https://img.shields.io/pypi/v/llama-stack-haystack.svg?color=orange)](https://pypi.org/project/llama-stack-haystack)                      | [![Test / llama-stack](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/llama_stack.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/llama_stack.yml)                                  |
 | [mcp-haystack](integrations/mcp/)                                                                              | Tool                        | [![PyPI - Version](https://img.shields.io/pypi/v/mcp-haystack.svg?color=orange)](https://pypi.org/project/mcp-haystack)                                  | [![Test / mcp](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mcp.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mcp.yml)                                                    |
 | [meta-llama-haystack](integrations/meta_llama/)                                                                | Generator                   | [![PyPI - Version](https://img.shields.io/pypi/v/meta-llama-haystack.svg?color=orange)](https://pypi.org/project/meta-llama-haystack)                    | [![Test / meta_llama](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/meta_llama.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/meta_llama.yml)                                                    |
-| [mistral-haystack](integrations/mistral/)                                                                      | Embedder, Generator         | [![PyPI - Version](https://img.shields.io/pypi/v/mistral-haystack.svg)](https://pypi.org/project/mistral-haystack)                                       | [![Test / mistral](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml)                                        |
+| [mistral-haystack](integrations/mistral/)                                                                      | Converter, Embedder, Generator         | [![PyPI - Version](https://img.shields.io/pypi/v/mistral-haystack.svg)](https://pypi.org/project/mistral-haystack)                                       | [![Test / mistral](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml)                                        |
 | [mongodb-atlas-haystack](integrations/mongodb_atlas/)                                                          | Document Store              | [![PyPI - Version](https://img.shields.io/pypi/v/mongodb-atlas-haystack.svg?color=orange)](https://pypi.org/project/mongodb-atlas-haystack)              | [![Test / mongodb-atlas](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mongodb_atlas.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mongodb_atlas.yml)                      |
 | [nvidia-haystack](integrations/nvidia/)                                                                        | Embedder, Generator, Ranker | [![PyPI - Version](https://img.shields.io/pypi/v/nvidia-haystack.svg?color=orange)](https://pypi.org/project/nvidia-haystack)                            | [![Test / nvidia](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/nvidia.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/nvidia.yml)                                           |
 | [ollama-haystack](integrations/ollama/)                                                                        | Embedder, Generator         | [![PyPI - Version](https://img.shields.io/pypi/v/ollama-haystack.svg?color=orange)](https://pypi.org/project/ollama-haystack)                            | [![Test / ollama](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/ollama.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/ollama.yml)                                           |
diff --git a/integrations/mistral/examples/indexing_ocr_pipeline.py b/integrations/mistral/examples/indexing_ocr_pipeline.py
@@ -0,0 +1,87 @@
+# To run this example, you will need to:
+# 1. Set a `MISTRAL_API_KEY` environment variable
+# 2. Place a PDF file named `sample.pdf` in the same directory as this script
+#
+# This example demonstrates OCR document processing with structured annotations,
+# embedding the extracted documents using Mistral embeddings, and storing them
+# in an InMemoryDocumentStore for later retrieval.
+#
+# You can customize the ImageAnnotation and DocumentAnnotation schemas below
+# to extract different structured information from your documents.
+
+from typing import List
+
+from haystack import Pipeline
+from haystack.components.writers import DocumentWriter
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from mistralai.models import DocumentURLChunk
+from pydantic import BaseModel, Field
+
+from haystack_integrations.components.converters.mistral.ocr_document_converter import (
+    MistralOCRDocumentConverter,
+)
+from haystack_integrations.components.embedders.mistral.document_embedder import (
+    MistralDocumentEmbedder,
+)
+
+
+# Define schema for structured image annotations (bbox)
+class ImageAnnotation(BaseModel):
+    image_type: str = Field(..., description="The type of image content")
+    description: str = Field(..., description="Brief description of the image")
+
+
+# Define schema for structured document annotations
+class DocumentAnnotation(BaseModel):
+    language: str = Field(..., description="Primary language of the document")
+    urls: List[str] = Field(..., description="URLs found in the document")
+    topics: List[str] = Field(..., description="Main topics covered in the document")
+
+
+# Initialize document store
+document_store = InMemoryDocumentStore()
+
+# Create indexing pipeline
+indexing_pipeline = Pipeline()
+
+# Add components to the pipeline
+indexing_pipeline.add_component(
+    "converter",
+    MistralOCRDocumentConverter(pages=[0, 1]),
+)
+indexing_pipeline.add_component(
+    "embedder",
+    MistralDocumentEmbedder(),
+)
+indexing_pipeline.add_component(
+    "writer",
+    DocumentWriter(document_store=document_store),
+)
+
+# Connect components
+indexing_pipeline.connect("converter.documents", "embedder.documents")
+indexing_pipeline.connect("embedder.documents", "writer.documents")
+
+# Prepare sources: URL and local file
+sources = [
+    DocumentURLChunk(document_url="https://arxiv.org/pdf/1706.03762"),
+    "./sample.pdf",  # Local PDF file
+]
+
+# Run the pipeline with annotation schemas
+result = indexing_pipeline.run(
+    {
+        "converter": {
+            "sources": sources,
+            "bbox_annotation_schema": ImageAnnotation,
+            "document_annotation_schema": DocumentAnnotation,
+        }
+    }
+)
+
+
+# Check out documents processed by OCR.
+# Optional with enriched content (from bbox annotation) and semantic meta data (from document annotation)
+documents = document_store.storage
+# Check out mistral api response for unprocessed data and with usage_info
+raw_mistral_response = result["converter"]["raw_mistral_response"]
diff --git a/integrations/mistral/pydoc/config.yml b/integrations/mistral/pydoc/config.yml
@@ -5,6 +5,7 @@ loaders:
       "haystack_integrations.components.embedders.mistral.document_embedder",
       "haystack_integrations.components.embedders.mistral.text_embedder",
       "haystack_integrations.components.generators.mistral.chat.chat_generator",
+      "haystack_integrations.components.converters.mistral.ocr_document_converter",
     ]
     ignore_when_discovered: ["__init__"]
 processors:
diff --git a/integrations/mistral/pydoc/config_docusaurus.yml b/integrations/mistral/pydoc/config_docusaurus.yml
@@ -5,6 +5,7 @@ loaders:
   - haystack_integrations.components.embedders.mistral.document_embedder
   - haystack_integrations.components.embedders.mistral.text_embedder
   - haystack_integrations.components.generators.mistral.chat.chat_generator
+  - haystack_integrations.components.converters.mistral.ocr_document_converter
   search_path:
   - ../src
   type: haystack_pydoc_tools.loaders.CustomPythonLoader
diff --git a/integrations/mistral/pyproject.toml b/integrations/mistral/pyproject.toml
@@ -23,7 +23,7 @@ classifiers = [
   "Programming Language :: Python :: Implementation :: CPython",
   "Programming Language :: Python :: Implementation :: PyPy",
 ]
-dependencies = ["haystack-ai>=2.19.0"]
+dependencies = ["haystack-ai>=2.19.0", "mistralai>=1.9.11"]
 
 [project.urls]
 Documentation = "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral#readme"
@@ -58,7 +58,7 @@ dependencies = [
     "pytest-rerunfailures",
     "mypy",
     "pip",
-    "pytz"
+    "pytz",
 ]
 
 [tool.hatch.envs.test.scripts]
@@ -68,7 +68,8 @@ all = 'pytest {args:tests}'
 cov-retry = 'all --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x'
 
 types = """mypy -p haystack_integrations.components.embedders.mistral \
--p haystack_integrations.components.generators.mistral {args}"""
+-p haystack_integrations.components.generators.mistral \
+-p haystack_integrations.components.converters {args}"""
 
 [tool.mypy]
 install_types = true
diff --git a/integrations/mistral/src/haystack_integrations/components/converters/mistral/__init__.py b/integrations/mistral/src/haystack_integrations/components/converters/mistral/__init__.py
@@ -0,0 +1,3 @@
+from .ocr_document_converter import MistralOCRDocumentConverter
+
+__all__ = ["MistralOCRDocumentConverter"]
diff --git a/integrations/mistral/src/haystack_integrations/components/converters/mistral/ocr_document_converter.py b/integrations/mistral/src/haystack_integrations/components/converters/mistral/ocr_document_converter.py
diff --git a/integrations/mistral/src/haystack_integrations/components/converters/py.typed b/integrations/mistral/src/haystack_integrations/components/converters/py.typed
diff --git a/integrations/mistral/tests/test_ocr_document_converter.py b/integrations/mistral/tests/test_ocr_document_converter.py

Original file line number	Diff line number	Diff line change
`@@ -5,6 +5,7 @@ loaders:`
`5`	`5`	`"haystack_integrations.components.embedders.mistral.document_embedder",`
`6`	`6`	`"haystack_integrations.components.embedders.mistral.text_embedder",`
`7`	`7`	`"haystack_integrations.components.generators.mistral.chat.chat_generator",`
	`8`	`+ "haystack_integrations.components.converters.mistral.ocr_document_converter",`
`8`	`9`	`]`
`9`	`10`	`ignore_when_discovered: ["__init__"]`
`10`	`11`	`processors:`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+from .ocr_document_converter import MistralOCRDocumentConverter`
	`2`	`+`
	`3`	`+__all__ = ["MistralOCRDocumentConverter"]`