ColPali leverages VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, ColPali create a multi-vector representation of documents. The model is trained to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.
Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.
This is a BentoML example project, demonstrating how to build a ColPali inference API server for ColPali. See here for a full list of BentoML example projects.
Note
The recommended ColPali checkpoint for this repository is vidore/colpali-v1.2
.
Fore more information on ColPali, please refer to:
- The original ColPali arXiv paper: ColPali: Efficient Document Retrieval with Vision Language Models 📝
- The official ColPali blog post: HuggingFace Blog 🤗
- The code/package for ColPali: colpali-engine. 🧑🏻💻
git clone https://github.com/bentoml/BentoColPali.git
cd BentoColPali
# Supports Python 3.9+
pip install -r requirements.txt
Before running the BentoML service, you need to download the ColPali model checkpoint and build the model using the following command:
python bentocolpali/models.py --model-name vidore/colpali-v1.2 --hf-token <YOUR_TOKEN>
Important
Because ColPali uses the PaliGemma (Gemma-licensed) as its VLM backbone, the account associated to the input HuggingFace token must have accepted the terms and conditions of google/paligemma-3b-mix-448
.
We have defined a BentoML Service in service.py
. Run bentoml serve
in your project directory to start the Service.
bentoml serve .
The Service is accessible at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways detailed in the Examples section.
Route | Input | Output | Description |
---|---|---|---|
/embed_images |
- items : List of ImagePayload |
Multi-vector embeddings | Generates image embeddings with shape (batch_size, sequence_length, embedding_dim). |
/embed_queries |
- items : List of strings |
Multi-vector embeddings | Generates query embeddings with shape (batch_size, sequence_length, embedding_dim). |
/score_embeddings |
- image_embeddings : List of 2D-arrays- query_embeddings : List of 2D-arrays |
Scores | Computes late-interaction/MaxSim scores between pre-computed embeddings. Returns scores with shape (num_queries, num_images). |
/score |
- images : List of ImagePayload - queries : List of strings |
Scores | One-shot computation of similarity scores between images and queries, i.e. run the 3 routes above in the right order. Returns scores with shape (num_queries, num_images). |
An ImagePayload
is a JSON object with a single field url
that contains a base64-encoded image. The url
field should be formatted like this:
{
"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEU..."
}
import bentoml
from PIL import Image
from bentocolpali.interfaces import ImagePayload
from bentocolpali.utils import convert_pil_to_b64_image
image_filepaths = ["page_1.jpg", "page_2.jpg"]
image_payloads = []
for filepath in image_filepaths:
image = Image.open(filepath)
image_payloads.append(ImagePayload(url=convert_pil_to_b64_image(image)))
queries = [
"How does the positional encoding work?",
"How does the scaled dot attention product work?",
]
with bentoml.SyncHTTPClient("http://localhost:3000") as client:
image_embeddings = client.embed_images(items=image_payloads)
query_embeddings = client.embed_queries(items=queries)
scores = client.score_embeddings(
image_embeddings=image_embeddings,
query_embeddings=query_embeddings,
)
print("Scores:", scores)
You should get a response similar to:
{
"score": [
[15.25727272, 6.47964382],
[11.67781448, 16.54862022]
]
}
Note: the strings in the base_64
fields are dummy examples.
curl -X POST -H "content-type: application/json" --data '{
"queries": [
"How does the positional encoding work?",
"How does the scaled dot attention product work?"
],
"images": [
{
"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEU..."
},
{
"url": "data:image/png;base64,iVBORw0KGFEWAAAANSUhU..."
}
]
}' http://localhost:3000/score
After the Service is ready, you can deploy the application to BentoCloud for better management and scalability. Sign up if you haven't got a BentoCloud account.
Make sure you have logged in to BentoCloud, then run the following command to deploy it.
bentoml deploy bento
Once the application is up and running on BentoCloud, you can access it via the exposed URL.
Note: For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.
ColPali: Efficient Document Retrieval with Vision Language Models
Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}