An implementation of "M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding" by Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal (UNC Chapel Hill & Bloomberg).
Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them.
This implementation provides:
- Multi-modal RAG framework for document understanding
- Support for both closed-domain and open-domain settings
- Efficient handling of multi-page documents
- Preservation of visual information in documents
- 🔍 Multi-modal document retrieval using ColPali
- 🤖 Visual question answering using Qwen2-VL
- 📄 Support for multi-page PDF documents
- 💾 Efficient FAISS indexing for fast retrieval
- 🎯 Optimized for multi-GPU environments
- 💡 Interactive command-line interface
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
.\venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Install system dependencies for PDF handling:
# Ubuntu/Debian
sudo apt-get install poppler-utils
# macOS
brew install poppler
# Windows
# Download and install poppler from: http://blog.alivate.com.au/poppler-windows/
The system uses the following models:
- Retrieval:
vidore/colpaligemma-3b-mix-448-base
withvidore/colpali
adapter - QA:
Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4
(quantized for memory efficiency)
- Start the interactive shell:
python m3-doc-rag.py
- Initialize the system:
(M3DOCRAG) init
- Add PDF documents:
(M3DOCRAG) add path/to/document.pdf
- Build the search index:
(M3DOCRAG) build
- Ask questions:
(M3DOCRAG) ask "What is the commercial franchising program?"
- List loaded documents:
(M3DOCRAG) list
- Exit the system:
(M3DOCRAG) exit
- CUDA-capable GPU with at least 16GB VRAM (recommended)
- 16GB+ RAM
- Python 3.8+
- Storage space for models and document index
The system is configured to use multiple GPUs efficiently:
- GPU 0: ColPali retrieval model
- GPU 1: Qwen2-VL QA model
Memory optimization features:
- Quantized QA model (GPTQ Int4)
- Batch size optimization
- Aggressive cache clearing
- Memory-efficient attention
-
Document Processing
- PDF to image conversion
- Page-level processing
- Multi-modal content handling
-
Retrieval System
- ColPali-based visual-text embeddings
- FAISS indexing for efficient search
- Approximate/exact index options
-
Question Answering
- Qwen2-VL visual language model
- Memory-efficient processing
- Batch-wise page handling
If you use this code in your research, please cite:
@article{cho2024m3docrag,
title={M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding},
author={Jaemin Cho and Debanjan Mahata and Ozan Irsoy and Yujie He and Mohit Bansal},
journal={arXiv preprint arXiv:2411.04952},
year={2024}
}
This is an unofficial implementation of the M3DOCRAG paper. The original paper was authored by researchers from UNC Chapel Hill and Bloomberg. This implementation uses:
- ColPali for multi-modal retrieval
- Qwen2-VL for visual question answering
- FAISS for efficient similarity search
- pdf2image for PDF processing
- Original Paper Authors:
- Jaemin Cho (UNC Chapel Hill)
- Debanjan Mahata (Bloomberg)
- Ozan Irsoy (Bloomberg)
- Yujie He (Bloomberg)
- Mohit Bansal (UNC Chapel Hill)
- Open-source communities behind the various libraries used in this implementation