M3DOCRAG: Multi-modal Multi-page Document RAG System

An implementation of "M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding" by Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal (UNC Chapel Hill & Bloomberg).

Paper Abstract

Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them.

This implementation provides:

Multi-modal RAG framework for document understanding
Support for both closed-domain and open-domain settings
Efficient handling of multi-page documents
Preservation of visual information in documents

Features

🔍 Multi-modal document retrieval using ColPali
🤖 Visual question answering using Qwen2-VL
📄 Support for multi-page PDF documents
💾 Efficient FAISS indexing for fast retrieval
🎯 Optimized for multi-GPU environments
💡 Interactive command-line interface

Installation

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate  # Windows

Install dependencies:

pip install -r requirements.txt

Install system dependencies for PDF handling:

# Ubuntu/Debian
sudo apt-get install poppler-utils

# macOS
brew install poppler

# Windows
# Download and install poppler from: http://blog.alivate.com.au/poppler-windows/

Models Setup

The system uses the following models:

Retrieval: vidore/colpaligemma-3b-mix-448-base with vidore/colpali adapter
QA: Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4 (quantized for memory efficiency)

Usage

Start the interactive shell:

python m3-doc-rag.py

Initialize the system:

(M3DOCRAG) init

Add PDF documents:

(M3DOCRAG) add path/to/document.pdf

Build the search index:

(M3DOCRAG) build

Ask questions:

(M3DOCRAG) ask "What is the commercial franchising program?"

List loaded documents:

(M3DOCRAG) list

Exit the system:

(M3DOCRAG) exit

System Requirements

CUDA-capable GPU with at least 16GB VRAM (recommended)
16GB+ RAM
Python 3.8+
Storage space for models and document index

GPU Memory Configuration

The system is configured to use multiple GPUs efficiently:

GPU 0: ColPali retrieval model
GPU 1: Qwen2-VL QA model

Memory optimization features:

Quantized QA model (GPTQ Int4)
Batch size optimization
Aggressive cache clearing
Memory-efficient attention

Architecture

Document Processing
- PDF to image conversion
- Page-level processing
- Multi-modal content handling
Retrieval System
- ColPali-based visual-text embeddings
- FAISS indexing for efficient search
- Approximate/exact index options
Question Answering
- Qwen2-VL visual language model
- Memory-efficient processing
- Batch-wise page handling

Citation

If you use this code in your research, please cite:

@article{cho2024m3docrag,
      title={M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding}, 
      author={Jaemin Cho and Debanjan Mahata and Ozan Irsoy and Yujie He and Mohit Bansal},
      journal={arXiv preprint arXiv:2411.04952},
      year={2024}
}

Code Implementation Credits

This is an unofficial implementation of the M3DOCRAG paper. The original paper was authored by researchers from UNC Chapel Hill and Bloomberg. This implementation uses:

ColPali for multi-modal retrieval
Qwen2-VL for visual question answering
FAISS for efficient similarity search
pdf2image for PDF processing

Acknowledgments

Original Paper Authors:
- Jaemin Cho (UNC Chapel Hill)
- Debanjan Mahata (Bloomberg)
- Ozan Irsoy (Bloomberg)
- Yujie He (Bloomberg)
- Mohit Bansal (UNC Chapel Hill)
Open-source communities behind the various libraries used in this implementation

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.MD		README.MD
m3-doc-rag-openai.py		m3-doc-rag-openai.py
m3-doc-rag.py		m3-doc-rag.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M3DOCRAG: Multi-modal Multi-page Document RAG System

Paper Abstract

Features

Installation

Models Setup

Usage

System Requirements

GPU Memory Configuration

Architecture

Citation

Code Implementation Credits

Acknowledgments

About

Releases

Packages

Languages

Omaralsaabi/M3DOCRAG

Folders and files

Latest commit

History

Repository files navigation

M3DOCRAG: Multi-modal Multi-page Document RAG System

Paper Abstract

Features

Installation

Models Setup

Usage

System Requirements

GPU Memory Configuration

Architecture

Citation

Code Implementation Credits

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages