Skip to content

Commit 890eb5c

Browse files
Hansehartanakin87
andauthored
feat: mistral ocr converter (#2376)
* Revise MCPTool usage example for Streamable HTTP Updated example usage in MCPTool documentation to reflect Streamable HTTP usage and mentioned deprecated SSE. * Clarify connection types in MCPToolset documentation Updated documentation to reflect changes in connection types for MCPToolset. * fix: Align with hatch run fmt requirements * add: MistralOCRDocumentConverter * add: project files * fix: example lib usage * move: ocr document converter into child /mistral * add: example usage with annotations * add: hatch run fmt * add: mistralai * fix: python3.9 compatibility with using Union, List, Optional * add: new comments and their position * add: moved schemas from init into run to bypass problems with serializing * add: docstring convention * add: process mutliple documents * add: robust api handling with catching mistral errors * add: Union[str, Path, ByteStream] as input * add: comment for new inputs * add: pipeline example * fix: example ocr component * fix: mistral file upload and pydantic v2 models * add: pipeline example * add: hint on document annotation page limit * add: mistralai as project dependency * fix: hatch run fmt * fix: hatch run docs * add: exlcuse mistral from compliance workflow (its apache 2.0) * add: 3 initialization tests * add: 4 se test * add: test w/ proper mocking * add: real api test when env is set * add: delete files by default from mistral if uploaded * fix: mock file deletion * fix: hatch run fmt * Apply suggestion from @anakin87 Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update integrations/mistral/src/haystack_integrations/components/converters/mistral/ocr_document_converter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * fix: nested try excepts * add: mention file upload * Update integrations/mistral/tests/test_ocr_document_converter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update integrations/mistral/tests/test_ocr_document_converter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * add: less test code due to pytest.mark..parametrize * add: less tests and const class type * fix: format * add: ocr document converter to docusaurus * add: converter to mistral --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
1 parent e8e0ebc commit 890eb5c

File tree

10 files changed

+1119
-5
lines changed

10 files changed

+1119
-5
lines changed

.github/workflows/CI_license_compliance.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,14 @@ on:
1313
env:
1414
CORE_DATADOG_API_KEY: ${{ secrets.CORE_DATADOG_API_KEY }}
1515
PYTHON_VERSION: "3.10"
16-
EXCLUDE_PACKAGES: "(?i)^(azure-identity|fastembed|ragas|tqdm|psycopg).*"
16+
EXCLUDE_PACKAGES: "(?i)^(azure-identity|fastembed|ragas|tqdm|psycopg|mistralai).*"
1717

1818
# Exclusions must be explicitly motivated
1919
#
2020
# - azure-identity is MIT but the license is not available on PyPI
2121
# - fastembed is Apache 2.0 but the license on PyPI is unclear ("Other/Proprietary License (Apache License)")
2222
# - ragas is Apache 2.0 but the license is not available on PyPI
23+
# - mistralai is Apache 2.0 but the license is not available on PyPI
2324

2425
# - tqdm is MLP but there are no better alternatives
2526
# - psycopg is LGPL-3.0 but FOSSA is fine with it

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ Please check out our [Contribution Guidelines](CONTRIBUTING.md) for all the deta
4646
| [llama-stack-haystack](integrations/llama_stack/) | Generator | [![PyPI - Version](https://img.shields.io/pypi/v/llama-stack-haystack.svg?color=orange)](https://pypi.org/project/llama-stack-haystack) | [![Test / llama-stack](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/llama_stack.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/llama_stack.yml) |
4747
| [mcp-haystack](integrations/mcp/) | Tool | [![PyPI - Version](https://img.shields.io/pypi/v/mcp-haystack.svg?color=orange)](https://pypi.org/project/mcp-haystack) | [![Test / mcp](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mcp.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mcp.yml) |
4848
| [meta-llama-haystack](integrations/meta_llama/) | Generator | [![PyPI - Version](https://img.shields.io/pypi/v/meta-llama-haystack.svg?color=orange)](https://pypi.org/project/meta-llama-haystack) | [![Test / meta_llama](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/meta_llama.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/meta_llama.yml) |
49-
| [mistral-haystack](integrations/mistral/) | Embedder, Generator | [![PyPI - Version](https://img.shields.io/pypi/v/mistral-haystack.svg)](https://pypi.org/project/mistral-haystack) | [![Test / mistral](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml) |
49+
| [mistral-haystack](integrations/mistral/) | Converter, Embedder, Generator | [![PyPI - Version](https://img.shields.io/pypi/v/mistral-haystack.svg)](https://pypi.org/project/mistral-haystack) | [![Test / mistral](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mistral.yml) |
5050
| [mongodb-atlas-haystack](integrations/mongodb_atlas/) | Document Store | [![PyPI - Version](https://img.shields.io/pypi/v/mongodb-atlas-haystack.svg?color=orange)](https://pypi.org/project/mongodb-atlas-haystack) | [![Test / mongodb-atlas](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mongodb_atlas.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/mongodb_atlas.yml) |
5151
| [nvidia-haystack](integrations/nvidia/) | Embedder, Generator, Ranker | [![PyPI - Version](https://img.shields.io/pypi/v/nvidia-haystack.svg?color=orange)](https://pypi.org/project/nvidia-haystack) | [![Test / nvidia](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/nvidia.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/nvidia.yml) |
5252
| [ollama-haystack](integrations/ollama/) | Embedder, Generator | [![PyPI - Version](https://img.shields.io/pypi/v/ollama-haystack.svg?color=orange)](https://pypi.org/project/ollama-haystack) | [![Test / ollama](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/ollama.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/ollama.yml) |
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# To run this example, you will need to:
2+
# 1. Set a `MISTRAL_API_KEY` environment variable
3+
# 2. Place a PDF file named `sample.pdf` in the same directory as this script
4+
#
5+
# This example demonstrates OCR document processing with structured annotations,
6+
# embedding the extracted documents using Mistral embeddings, and storing them
7+
# in an InMemoryDocumentStore for later retrieval.
8+
#
9+
# You can customize the ImageAnnotation and DocumentAnnotation schemas below
10+
# to extract different structured information from your documents.
11+
12+
from typing import List
13+
14+
from haystack import Pipeline
15+
from haystack.components.writers import DocumentWriter
16+
from haystack.document_stores.in_memory import InMemoryDocumentStore
17+
from mistralai.models import DocumentURLChunk
18+
from pydantic import BaseModel, Field
19+
20+
from haystack_integrations.components.converters.mistral.ocr_document_converter import (
21+
MistralOCRDocumentConverter,
22+
)
23+
from haystack_integrations.components.embedders.mistral.document_embedder import (
24+
MistralDocumentEmbedder,
25+
)
26+
27+
28+
# Define schema for structured image annotations (bbox)
29+
class ImageAnnotation(BaseModel):
30+
image_type: str = Field(..., description="The type of image content")
31+
description: str = Field(..., description="Brief description of the image")
32+
33+
34+
# Define schema for structured document annotations
35+
class DocumentAnnotation(BaseModel):
36+
language: str = Field(..., description="Primary language of the document")
37+
urls: List[str] = Field(..., description="URLs found in the document")
38+
topics: List[str] = Field(..., description="Main topics covered in the document")
39+
40+
41+
# Initialize document store
42+
document_store = InMemoryDocumentStore()
43+
44+
# Create indexing pipeline
45+
indexing_pipeline = Pipeline()
46+
47+
# Add components to the pipeline
48+
indexing_pipeline.add_component(
49+
"converter",
50+
MistralOCRDocumentConverter(pages=[0, 1]),
51+
)
52+
indexing_pipeline.add_component(
53+
"embedder",
54+
MistralDocumentEmbedder(),
55+
)
56+
indexing_pipeline.add_component(
57+
"writer",
58+
DocumentWriter(document_store=document_store),
59+
)
60+
61+
# Connect components
62+
indexing_pipeline.connect("converter.documents", "embedder.documents")
63+
indexing_pipeline.connect("embedder.documents", "writer.documents")
64+
65+
# Prepare sources: URL and local file
66+
sources = [
67+
DocumentURLChunk(document_url="https://arxiv.org/pdf/1706.03762"),
68+
"./sample.pdf", # Local PDF file
69+
]
70+
71+
# Run the pipeline with annotation schemas
72+
result = indexing_pipeline.run(
73+
{
74+
"converter": {
75+
"sources": sources,
76+
"bbox_annotation_schema": ImageAnnotation,
77+
"document_annotation_schema": DocumentAnnotation,
78+
}
79+
}
80+
)
81+
82+
83+
# Check out documents processed by OCR.
84+
# Optional with enriched content (from bbox annotation) and semantic meta data (from document annotation)
85+
documents = document_store.storage
86+
# Check out mistral api response for unprocessed data and with usage_info
87+
raw_mistral_response = result["converter"]["raw_mistral_response"]

integrations/mistral/pydoc/config.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ loaders:
55
"haystack_integrations.components.embedders.mistral.document_embedder",
66
"haystack_integrations.components.embedders.mistral.text_embedder",
77
"haystack_integrations.components.generators.mistral.chat.chat_generator",
8+
"haystack_integrations.components.converters.mistral.ocr_document_converter",
89
]
910
ignore_when_discovered: ["__init__"]
1011
processors:

integrations/mistral/pydoc/config_docusaurus.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ loaders:
55
- haystack_integrations.components.embedders.mistral.document_embedder
66
- haystack_integrations.components.embedders.mistral.text_embedder
77
- haystack_integrations.components.generators.mistral.chat.chat_generator
8+
- haystack_integrations.components.converters.mistral.ocr_document_converter
89
search_path:
910
- ../src
1011
type: haystack_pydoc_tools.loaders.CustomPythonLoader

integrations/mistral/pyproject.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ classifiers = [
2323
"Programming Language :: Python :: Implementation :: CPython",
2424
"Programming Language :: Python :: Implementation :: PyPy",
2525
]
26-
dependencies = ["haystack-ai>=2.19.0"]
26+
dependencies = ["haystack-ai>=2.19.0", "mistralai>=1.9.11"]
2727

2828
[project.urls]
2929
Documentation = "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral#readme"
@@ -58,7 +58,7 @@ dependencies = [
5858
"pytest-rerunfailures",
5959
"mypy",
6060
"pip",
61-
"pytz"
61+
"pytz",
6262
]
6363

6464
[tool.hatch.envs.test.scripts]
@@ -68,7 +68,8 @@ all = 'pytest {args:tests}'
6868
cov-retry = 'all --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x'
6969

7070
types = """mypy -p haystack_integrations.components.embedders.mistral \
71-
-p haystack_integrations.components.generators.mistral {args}"""
71+
-p haystack_integrations.components.generators.mistral \
72+
-p haystack_integrations.components.converters {args}"""
7273

7374
[tool.mypy]
7475
install_types = true
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .ocr_document_converter import MistralOCRDocumentConverter
2+
3+
__all__ = ["MistralOCRDocumentConverter"]

0 commit comments

Comments
 (0)