From 9dd0e0b26ffeb7eb15f12b69465fabb89319b3ca Mon Sep 17 00:00:00 2001 From: aakash Date: Mon, 10 Nov 2025 13:31:58 -0800 Subject: [PATCH 1/7] feat: Add ColQwen multimodal PDF retrieval integration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add ColQwenRAG class with easy-to-use CLI for multimodal PDF retrieval - Support for both ColQwen2 and ColPali models with automatic device selection - MPS optimization for Apple Silicon with memory-efficient loading - Complete pipeline: PDFβ†’imagesβ†’embeddingsβ†’HNSW indexβ†’search - Multi-vector indexing for fine-grained document matching - Comprehensive user guide and reproduction test script - Resolves #119: ColQwen Doc and Support Management Features: - python -m apps.colqwen_rag build --pdfs ./pdfs/ --index my_index - python -m apps.colqwen_rag search my_index "query text" - python -m apps.colqwen_rag ask my_index --interactive - Automatic CPU fallback for memory constraints - Robust error handling and progress tracking --- COLQWEN_GUIDE.md | 200 +++++++++++++++++++ apps/colqwen_rag.py | 364 +++++++++++++++++++++++++++++++++++ test_colqwen_reproduction.py | 156 +++++++++++++++ 3 files changed, 720 insertions(+) create mode 100644 COLQWEN_GUIDE.md create mode 100644 apps/colqwen_rag.py create mode 100644 test_colqwen_reproduction.py diff --git a/COLQWEN_GUIDE.md b/COLQWEN_GUIDE.md new file mode 100644 index 00000000..42772f62 --- /dev/null +++ b/COLQWEN_GUIDE.md @@ -0,0 +1,200 @@ +# ColQwen Integration Guide + +Easy-to-use multimodal PDF retrieval with ColQwen2/ColPali models. + +## Quick Start + +> **🍎 Mac Users**: ColQwen is optimized for Apple Silicon with MPS acceleration for faster inference! + +### 1. Install Dependencies +```bash +uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn +brew install poppler # macOS only, for PDF processing +``` + +### 2. Basic Usage +```bash +# Build index from PDFs +python -m apps.colqwen_rag build --pdfs ./my_papers/ --index research_papers + +# Search with text queries +python -m apps.colqwen_rag search research_papers "How does attention mechanism work?" + +# Interactive Q&A +python -m apps.colqwen_rag ask research_papers --interactive +``` + +## Commands + +### Build Index +```bash +python -m apps.colqwen_rag build \ + --pdfs ./pdf_directory/ \ + --index my_index \ + --model colqwen2 \ + --pages-dir ./page_images/ # Optional: save page images +``` + +**Options:** +- `--pdfs`: Directory containing PDF files (or single PDF path) +- `--index`: Name for the index (required) +- `--model`: `colqwen2` (default) or `colpali` +- `--pages-dir`: Directory to save page images (optional) + +### Search Index +```bash +python -m apps.colqwen_rag search my_index "your question here" --top-k 5 +``` + +**Options:** +- `--top-k`: Number of results to return (default: 5) +- `--model`: Model used for search (should match build model) + +### Interactive Q&A +```bash +python -m apps.colqwen_rag ask my_index --interactive +``` + +**Commands in interactive mode:** +- Type your questions naturally +- `help`: Show available commands +- `quit`/`exit`/`q`: Exit interactive mode + +## πŸ§ͺ Test & Reproduce Results + +Run the reproduction test for issue #119: +```bash +python test_colqwen_reproduction.py +``` + +This will: +1. βœ… Check dependencies +2. πŸ“₯ Download sample PDF (Attention Is All You Need paper) +3. πŸ—οΈ Build test index +4. πŸ” Run sample queries +5. πŸ“Š Show how to generate similarity maps + +## 🎨 Advanced: Similarity Maps + +For visual similarity analysis, use the existing advanced script: +```bash +cd apps/multimodal/vision-based-pdf-multi-vector/ +python multi-vector-leann-similarity-map.py +``` + +Edit the script to customize: +- `QUERY`: Your question +- `MODEL`: "colqwen2" or "colpali" +- `USE_HF_DATASET`: Use HuggingFace dataset or local PDFs +- `SIMILARITY_MAP`: Generate heatmaps +- `ANSWER`: Enable Qwen-VL answer generation + +## πŸ”§ How It Works + +### ColQwen2 vs ColPali +- **ColQwen2** (`vidore/colqwen2-v1.0`): Latest vision-language model +- **ColPali** (`vidore/colpali-v1.2`): Proven multimodal retriever + +### Architecture +1. **PDF β†’ Images**: Convert PDF pages to images (150 DPI) +2. **Vision Encoding**: Process images with ColQwen2/ColPali +3. **Multi-Vector Index**: Build LEANN HNSW index with multiple embeddings per page +4. **Query Processing**: Encode text queries with same model +5. **Similarity Search**: Find most relevant pages/regions +6. **Visual Maps**: Generate attention heatmaps (optional) + +### Device Support +- **CUDA**: Best performance with GPU acceleration +- **MPS**: Apple Silicon Mac support +- **CPU**: Fallback for any system (slower) + +Auto-detection: CUDA > MPS > CPU + +## πŸ“Š Performance Tips + +### For Best Performance: +```bash +# Use ColQwen2 for latest features +--model colqwen2 + +# Save page images for reuse +--pages-dir ./cached_pages/ + +# Adjust batch size based on GPU memory +# (automatically handled) +``` + +### For Large Document Sets: +- Process PDFs in batches +- Use SSD storage for index files +- Consider using CUDA if available + +## πŸ”— Related Resources + +- **Fast-PLAID**: https://github.com/lightonai/fast-plaid +- **Pylate**: https://github.com/lightonai/pylate +- **ColBERT**: https://github.com/stanford-futuredata/ColBERT +- **ColPali Paper**: Vision-Language Models for Document Retrieval +- **Issue #119**: https://github.com/yichuan-w/LEANN/issues/119 + +## πŸ› Troubleshooting + +### PDF Conversion Issues (macOS) +```bash +# Install poppler +brew install poppler +which pdfinfo && pdfinfo -v +``` + +### Memory Issues +- Reduce batch size (automatically handled) +- Use CPU instead of GPU: `export CUDA_VISIBLE_DEVICES=""` +- Process fewer PDFs at once + +### Model Download Issues +- Ensure internet connection for first run +- Models are cached after first download +- Use HuggingFace mirrors if needed + +### Import Errors +```bash +# Ensure all dependencies installed +uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn + +# Check PyTorch installation +python -c "import torch; print(torch.__version__)" +``` + +## πŸ’‘ Examples + +### Research Paper Analysis +```bash +# Index your research papers +python -m apps.colqwen_rag build --pdfs ~/Papers/AI/ --index ai_papers + +# Ask research questions +python -m apps.colqwen_rag search ai_papers "What are the limitations of transformer models?" +python -m apps.colqwen_rag search ai_papers "How does BERT compare to GPT?" +``` + +### Document Q&A +```bash +# Index business documents +python -m apps.colqwen_rag build --pdfs ~/Documents/Reports/ --index reports + +# Interactive analysis +python -m apps.colqwen_rag ask reports --interactive +``` + +### Visual Analysis +```bash +# Generate similarity maps for specific queries +cd apps/multimodal/vision-based-pdf-multi-vector/ +# Edit multi-vector-leann-similarity-map.py with your query +python multi-vector-leann-similarity-map.py +# Check ./figures/ for generated heatmaps +``` + +--- + +**🎯 This integration makes ColQwen as easy to use as other LEANN features while maintaining the full power of multimodal document understanding!** diff --git a/apps/colqwen_rag.py b/apps/colqwen_rag.py new file mode 100644 index 00000000..5c61487e --- /dev/null +++ b/apps/colqwen_rag.py @@ -0,0 +1,364 @@ +#!/usr/bin/env python3 +""" +ColQwen RAG - Easy-to-use multimodal PDF retrieval with ColQwen2/ColPali + +Usage: + python -m apps.colqwen_rag build --pdfs ./my_pdfs/ --index my_index + python -m apps.colqwen_rag search my_index "How does attention work?" + python -m apps.colqwen_rag ask my_index --interactive +""" + +import argparse +import os +import sys +from pathlib import Path +from typing import Optional, cast + +# Add LEANN packages to path +_repo_root = Path(__file__).resolve().parents[1] +_leann_core_src = _repo_root / "packages" / "leann-core" / "src" +_leann_hnsw_pkg = _repo_root / "packages" / "leann-backend-hnsw" +if str(_leann_core_src) not in sys.path: + sys.path.append(str(_leann_core_src)) +if str(_leann_hnsw_pkg) not in sys.path: + sys.path.append(str(_leann_hnsw_pkg)) + +import torch +from colpali_engine import ColPali, ColPaliProcessor, ColQwen2, ColQwen2Processor +from colpali_engine.utils.torch_utils import ListDataset +from pdf2image import convert_from_path +from PIL import Image +from torch.utils.data import DataLoader +from tqdm import tqdm + +# Import the existing multi-vector implementation +sys.path.append(str(_repo_root / "apps" / "multimodal" / "vision-based-pdf-multi-vector")) +from leann_multi_vector import LeannMultiVector + + +class ColQwenRAG: + """Easy-to-use ColQwen RAG system for multimodal PDF retrieval.""" + + def __init__(self, model_type: str = "colpali"): + """ + Initialize ColQwen RAG system. + + Args: + model_type: "colqwen2" or "colpali" + """ + self.model_type = model_type + self.device = self._get_device() + # Use float32 on MPS to avoid memory issues, float16 on CUDA, bfloat16 on CPU + if self.device.type == "mps": + self.dtype = torch.float32 + elif self.device.type == "cuda": + self.dtype = torch.float16 + else: + self.dtype = torch.bfloat16 + + print(f"πŸš€ Initializing {model_type.upper()} on {self.device} with {self.dtype}") + + # Load model and processor with MPS-optimized settings + try: + if model_type == "colqwen2": + self.model_name = "vidore/colqwen2-v1.0" + if self.device.type == "mps": + # For MPS, load on CPU first then move to avoid memory allocation issues + self.model = ColQwen2.from_pretrained( + self.model_name, + torch_dtype=self.dtype, + device_map="cpu", + low_cpu_mem_usage=True, + ).eval() + self.model = self.model.to(self.device) + else: + self.model = ColQwen2.from_pretrained( + self.model_name, + torch_dtype=self.dtype, + device_map=self.device, + low_cpu_mem_usage=True, + ).eval() + self.processor = ColQwen2Processor.from_pretrained(self.model_name) + else: # colpali + self.model_name = "vidore/colpali-v1.2" + if self.device.type == "mps": + # For MPS, load on CPU first then move to avoid memory allocation issues + self.model = ColPali.from_pretrained( + self.model_name, + torch_dtype=self.dtype, + device_map="cpu", + low_cpu_mem_usage=True, + ).eval() + self.model = self.model.to(self.device) + else: + self.model = ColPali.from_pretrained( + self.model_name, + torch_dtype=self.dtype, + device_map=self.device, + low_cpu_mem_usage=True, + ).eval() + self.processor = ColPaliProcessor.from_pretrained(self.model_name) + except Exception as e: + if "memory" in str(e).lower() or "offload" in str(e).lower(): + print(f"⚠️ Memory constraint on {self.device}, using CPU with optimizations...") + self.device = torch.device("cpu") + self.dtype = torch.float32 + + if model_type == "colqwen2": + self.model = ColQwen2.from_pretrained( + self.model_name, + torch_dtype=self.dtype, + device_map="cpu", + low_cpu_mem_usage=True, + ).eval() + else: + self.model = ColPali.from_pretrained( + self.model_name, + torch_dtype=self.dtype, + device_map="cpu", + low_cpu_mem_usage=True, + ).eval() + else: + raise + + def _get_device(self): + """Auto-select best available device.""" + if torch.cuda.is_available(): + return torch.device("cuda") + elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): + return torch.device("mps") + else: + return torch.device("cpu") + + def build_index(self, pdf_paths: list[str], index_name: str, pages_dir: Optional[str] = None): + """ + Build multimodal index from PDF files. + + Args: + pdf_paths: List of PDF file paths + index_name: Name for the index + pages_dir: Directory to save page images (optional) + """ + print(f"Building index '{index_name}' from {len(pdf_paths)} PDFs...") + + # Convert PDFs to images + all_images = [] + all_metadata = [] + + if pages_dir: + os.makedirs(pages_dir, exist_ok=True) + + for pdf_path in tqdm(pdf_paths, desc="Converting PDFs"): + try: + images = convert_from_path(pdf_path, dpi=150) + pdf_name = Path(pdf_path).stem + + for i, image in enumerate(images): + # Save image if pages_dir specified + if pages_dir: + image_path = Path(pages_dir) / f"{pdf_name}_page_{i + 1}.png" + image.save(image_path) + + all_images.append(image) + all_metadata.append( + { + "pdf_path": pdf_path, + "pdf_name": pdf_name, + "page_number": i + 1, + "image_path": str(image_path) if pages_dir else None, + } + ) + + except Exception as e: + print(f"❌ Error processing {pdf_path}: {e}") + continue + + print(f"πŸ“„ Converted {len(all_images)} pages from {len(pdf_paths)} PDFs") + print(f"All metadata: {all_metadata}") + + # Generate embeddings + print("🧠 Generating embeddings...") + embeddings = self._embed_images(all_images) + + # Build LEANN index + print("πŸ” Building LEANN index...") + leann_mv = LeannMultiVector( + index_path=index_name, + dim=embeddings.shape[-1], + embedding_model_name=self.model_type, + ) + + # Create collection and insert data + leann_mv.create_collection() + for i, (embedding, metadata) in enumerate(zip(embeddings, all_metadata)): + data = { + "doc_id": i, + "filepath": metadata.get("image_path", ""), + "colbert_vecs": embedding.numpy(), # Convert tensor to numpy + } + leann_mv.insert(data) + + # Build the index + leann_mv.create_index() + print(f"βœ… Index '{index_name}' built successfully!") + + return leann_mv + + def search(self, index_name: str, query: str, top_k: int = 5): + """ + Search the index with a text query. + + Args: + index_name: Name of the index to search + query: Text query + top_k: Number of results to return + """ + print(f"πŸ” Searching '{index_name}' for: '{query}'") + + # Load index + leann_mv = LeannMultiVector( + index_path=index_name, + dim=128, # Will be updated when loading + embedding_model_name=self.model_type, + ) + + # Generate query embedding + query_embedding = self._embed_query(query) + + # Search (returns list of (score, doc_id) tuples) + search_results = leann_mv.search(query_embedding.numpy(), topk=top_k) + + # Display results + print(f"\nπŸ“‹ Top {len(search_results)} results:") + for i, (score, doc_id) in enumerate(search_results, 1): + # Get metadata for this doc_id (we need to load the metadata) + print(f"{i}. Score: {score:.3f} | Doc ID: {doc_id}") + + return search_results + + def ask(self, index_name: str, interactive: bool = False): + """ + Interactive Q&A with the indexed documents. + + Args: + index_name: Name of the index to query + interactive: Whether to run in interactive mode + """ + print(f"πŸ’¬ ColQwen Chat with '{index_name}'") + + if interactive: + print("Type 'quit' to exit, 'help' for commands") + while True: + try: + query = input("\nπŸ€” Your question: ").strip() + if query.lower() in ["quit", "exit", "q"]: + break + elif query.lower() == "help": + print("Commands: quit/exit/q (exit), help (this message)") + continue + elif not query: + continue + + results = self.search(index_name, query, top_k=3) + + # TODO: Add answer generation with Qwen-VL + print("\nπŸ’‘ For detailed answers, we can integrate Qwen-VL here!") + + except KeyboardInterrupt: + print("\nπŸ‘‹ Goodbye!") + break + else: + query = input("πŸ€” Your question: ").strip() + if query: + self.search(index_name, query) + + def _embed_images(self, images: list[Image.Image]) -> torch.Tensor: + """Generate embeddings for a list of images.""" + dataset = ListDataset(images) + dataloader = DataLoader(dataset, batch_size=1, shuffle=False, collate_fn=lambda x: x) + + embeddings = [] + with torch.no_grad(): + for batch in tqdm(dataloader, desc="Embedding images"): + batch_images = cast(list, batch) + batch_inputs = self.processor.process_images(batch_images).to(self.device) + batch_embeddings = self.model(**batch_inputs) + embeddings.append(batch_embeddings.cpu()) + + return torch.cat(embeddings, dim=0) + + def _embed_query(self, query: str) -> torch.Tensor: + """Generate embedding for a text query.""" + with torch.no_grad(): + query_inputs = self.processor.process_queries([query]).to(self.device) + query_embedding = self.model(**query_inputs) + return query_embedding.cpu() + + +def main(): + parser = argparse.ArgumentParser(description="ColQwen RAG - Easy multimodal PDF retrieval") + subparsers = parser.add_subparsers(dest="command", help="Available commands") + + # Build command + build_parser = subparsers.add_parser("build", help="Build index from PDFs") + build_parser.add_argument("--pdfs", required=True, help="Directory containing PDF files") + build_parser.add_argument("--index", required=True, help="Index name") + build_parser.add_argument( + "--model", choices=["colqwen2", "colpali"], default="colqwen2", help="Model to use" + ) + build_parser.add_argument("--pages-dir", help="Directory to save page images") + + # Search command + search_parser = subparsers.add_parser("search", help="Search the index") + search_parser.add_argument("index", help="Index name") + search_parser.add_argument("query", help="Search query") + search_parser.add_argument("--top-k", type=int, default=5, help="Number of results") + search_parser.add_argument( + "--model", choices=["colqwen2", "colpali"], default="colqwen2", help="Model to use" + ) + + # Ask command + ask_parser = subparsers.add_parser("ask", help="Interactive Q&A") + ask_parser.add_argument("index", help="Index name") + ask_parser.add_argument("--interactive", action="store_true", help="Interactive mode") + ask_parser.add_argument( + "--model", choices=["colqwen2", "colpali"], default="colqwen2", help="Model to use" + ) + + args = parser.parse_args() + + if not args.command: + parser.print_help() + return + + # Initialize ColQwen RAG + if args.command == "build": + colqwen = ColQwenRAG(args.model) + + # Get PDF files + pdf_dir = Path(args.pdfs) + if pdf_dir.is_file() and pdf_dir.suffix.lower() == ".pdf": + pdf_paths = [str(pdf_dir)] + elif pdf_dir.is_dir(): + pdf_paths = [str(p) for p in pdf_dir.glob("*.pdf")] + else: + print(f"❌ Invalid PDF path: {args.pdfs}") + return + + if not pdf_paths: + print(f"❌ No PDF files found in {args.pdfs}") + return + + colqwen.build_index(pdf_paths, args.index, args.pages_dir) + + elif args.command == "search": + colqwen = ColQwenRAG(args.model) + colqwen.search(args.index, args.query, args.top_k) + + elif args.command == "ask": + colqwen = ColQwenRAG(args.model) + colqwen.ask(args.index, args.interactive) + + +if __name__ == "__main__": + main() diff --git a/test_colqwen_reproduction.py b/test_colqwen_reproduction.py new file mode 100644 index 00000000..2af8d9c8 --- /dev/null +++ b/test_colqwen_reproduction.py @@ -0,0 +1,156 @@ +#!/usr/bin/env python3 +""" +Test script to reproduce ColQwen results from issue #119 +https://github.com/yichuan-w/LEANN/issues/119 + +This script demonstrates the ColQwen workflow: +1. Download sample PDF +2. Convert to images +3. Build multimodal index +4. Run test queries +5. Generate similarity maps +""" + +import os +from pathlib import Path + + +def main(): + print("πŸ§ͺ ColQwen Reproduction Test - Issue #119") + print("=" * 50) + + # Check if we're in the right directory + repo_root = Path.cwd() + if not (repo_root / "apps" / "colqwen_rag.py").exists(): + print("❌ Please run this script from the LEANN repository root") + print(" cd /path/to/LEANN && python test_colqwen_reproduction.py") + return + + print("βœ… Repository structure looks good") + + # Step 1: Check dependencies + print("\nπŸ“¦ Checking dependencies...") + try: + import pdf2image + import torch + from colpali_engine.models import ColQwen2 + + print("βœ… Core dependencies available") + print(f" - PyTorch: {torch.__version__}") + print(f" - CUDA available: {torch.cuda.is_available()}") + print( + f" - MPS available: {hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()}" + ) + except ImportError as e: + print(f"❌ Missing dependency: {e}") + print("\nπŸ“₯ Install missing dependencies:") + print( + " uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn" + ) + return + + # Step 2: Download sample PDF + print("\nπŸ“„ Setting up sample PDF...") + pdf_dir = repo_root / "test_pdfs" + pdf_dir.mkdir(exist_ok=True) + sample_pdf = pdf_dir / "attention_paper.pdf" + + if not sample_pdf.exists(): + print("πŸ“₯ Downloading sample paper (Attention Is All You Need)...") + import urllib.request + + try: + urllib.request.urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", sample_pdf) + print(f"βœ… Downloaded: {sample_pdf}") + except Exception as e: + print(f"❌ Download failed: {e}") + print(" Please manually download a PDF to test_pdfs/attention_paper.pdf") + return + else: + print(f"βœ… Using existing PDF: {sample_pdf}") + + # Step 3: Test ColQwen RAG + print("\nπŸš€ Testing ColQwen RAG...") + + # Build index + print("\n1️⃣ Building multimodal index...") + build_cmd = f"python -m apps.colqwen_rag build --pdfs {pdf_dir} --index test_attention --model colqwen2 --pages-dir test_pages" + print(f" Command: {build_cmd}") + + try: + result = os.system(build_cmd) + if result == 0: + print("βœ… Index built successfully!") + else: + print("❌ Index building failed") + return + except Exception as e: + print(f"❌ Error building index: {e}") + return + + # Test search + print("\n2️⃣ Testing search...") + test_queries = [ + "How does attention mechanism work?", + "What is the transformer architecture?", + "How do you compute self-attention?", + ] + + for query in test_queries: + print(f"\nπŸ” Query: '{query}'") + search_cmd = f'python -m apps.colqwen_rag search test_attention "{query}" --top-k 3' + print(f" Command: {search_cmd}") + + try: + result = os.system(search_cmd) + if result == 0: + print("βœ… Search completed") + else: + print("❌ Search failed") + except Exception as e: + print(f"❌ Search error: {e}") + + # Test interactive mode (briefly) + print("\n3️⃣ Testing interactive mode...") + print(" You can test interactive mode with:") + print(" python -m apps.colqwen_rag ask test_attention --interactive") + + # Step 4: Test similarity maps (using existing script) + print("\n4️⃣ Testing similarity maps...") + similarity_script = ( + repo_root + / "apps" + / "multimodal" + / "vision-based-pdf-multi-vector" + / "multi-vector-leann-similarity-map.py" + ) + + if similarity_script.exists(): + print(" You can generate similarity maps with:") + print(f" cd {similarity_script.parent}") + print(" python multi-vector-leann-similarity-map.py") + print(" (Edit the script to use your local PDF)") + + print("\nπŸŽ‰ ColQwen reproduction test completed!") + print("\nπŸ“‹ Summary:") + print(" βœ… Dependencies checked") + print(" βœ… Sample PDF prepared") + print(" βœ… Index building tested") + print(" βœ… Search functionality tested") + print(" βœ… Interactive mode available") + print(" βœ… Similarity maps available") + + print("\nπŸ”— Related repositories to check:") + print(" - https://github.com/lightonai/fast-plaid") + print(" - https://github.com/lightonai/pylate") + print(" - https://github.com/stanford-futuredata/ColBERT") + + print("\nπŸ“ Next steps:") + print(" 1. Test with your own PDFs") + print(" 2. Experiment with different queries") + print(" 3. Generate similarity maps for visual analysis") + print(" 4. Compare ColQwen2 vs ColPali performance") + + +if __name__ == "__main__": + main() From 9b7353f33676869a156a46c303110e5cf47e08d4 Mon Sep 17 00:00:00 2001 From: aakash Date: Tue, 11 Nov 2025 05:12:49 -0800 Subject: [PATCH 2/7] Fix linting errors in colqwen_rag.py and test_colqwen_reproduction.py - Add noqa comments for E402 errors (imports after sys.path modifications) - Remove unused variable assignment in colqwen_rag.py - Use importlib.util.find_spec for dependency checks instead of unused imports - Fix import ordering in test_colqwen_reproduction.py --- README.md | 2 +- apps/colqwen_rag.py | 18 +++++++++--------- packages/leann-backend-hnsw/third_party/faiss | 2 +- test_colqwen_reproduction.py | 10 ++++++++-- 4 files changed, 19 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 841485b4..0b0e59f4 100755 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ LEANN is an innovative vector database that democratizes personal AI. Transform LEANN achieves this through *graph-based selective recomputation* with *high-degree preserving pruning*, computing embeddings on-demand instead of storing them all. [Illustration Fig β†’](#️-architecture--how-it-works) | [Paper β†’](https://arxiv.org/abs/2506.08276) -**Ready to RAG Everything?** Transform your laptop into a personal AI assistant that can semantic search your **[file system](#-personal-data-manager-process-any-documents-pdf-txt-md)**, **[emails](#-your-personal-email-secretary-rag-on-apple-mail)**, **[browser history](#-time-machine-for-the-web-rag-your-entire-browser-history)**, **[chat history](#-wechat-detective-unlock-your-golden-memories)** ([WeChat](#-wechat-detective-unlock-your-golden-memories), [iMessage](#-imessage-history-your-personal-conversation-archive)), **[agent memory](#-chatgpt-chat-history-your-personal-ai-conversation-archive)** ([ChatGPT](#-chatgpt-chat-history-your-personal-ai-conversation-archive), [Claude](#-claude-chat-history-your-personal-ai-conversation-archive)), **[live data](#mcp-integration-rag-on-live-data-from-any-platform)** ([Slack](#mcp-integration-rag-on-live-data-from-any-platform), [Twitter](#mcp-integration-rag-on-live-data-from-any-platform)), **[codebase](#-claude-code-integration-transform-your-development-workflow)**\* , or external knowledge bases (i.e., 60M documents) - all on your laptop, with zero cloud costs and complete privacy. +**Ready to RAG Everything?** Transform your laptop into a personal AI assistant that can semantic search your **[file system](#-personal-data-manager-process-any-documents-pdf-txt-md)**, **[emails](#-your-personal-email-secretary-rag-on-apple-mail)**, **[browser history](#-time-machine-for-the-web-rag-your-entire-browser-history)**, **[chat history](#-wechat-detective-unlock-your-golden-memories)** ([WeChat](#-wechat-detective-unlock-your-golden-memories), [iMessage](#-imessage-history-your-personal-conversation-archive)), **[agent memory](#-chatgpt-chat-history-your-personal-ai-conversation-archive)** ([ChatGPT](#-chatgpt-chat-history-your-personal-ai-conversation-archive), [Claude](#-claude-chat-history-your-personal-ai-conversation-archive)), **[live data](#mcp-integration-rag-on-live-data-from-any-platform)** ([Slack](#slack-messages-search-your-team-conversations), [Twitter](#-twitter-bookmarks-your-personal-tweet-library)), **[codebase](#-claude-code-integration-transform-your-development-workflow)**\* , or external knowledge bases (i.e., 60M documents) - all on your laptop, with zero cloud costs and complete privacy. \* Claude Code only supports basic `grep`-style keyword search. **LEANN** is a drop-in **semantic search MCP service fully compatible with Claude Code**, unlocking intelligent retrieval without changing your workflow. πŸ”₯ Check out [the easy setup β†’](packages/leann-mcp/README.md) diff --git a/apps/colqwen_rag.py b/apps/colqwen_rag.py index 5c61487e..a30058f0 100644 --- a/apps/colqwen_rag.py +++ b/apps/colqwen_rag.py @@ -23,17 +23,17 @@ if str(_leann_hnsw_pkg) not in sys.path: sys.path.append(str(_leann_hnsw_pkg)) -import torch -from colpali_engine import ColPali, ColPaliProcessor, ColQwen2, ColQwen2Processor -from colpali_engine.utils.torch_utils import ListDataset -from pdf2image import convert_from_path -from PIL import Image -from torch.utils.data import DataLoader -from tqdm import tqdm +import torch # noqa: E402 +from colpali_engine import ColPali, ColPaliProcessor, ColQwen2, ColQwen2Processor # noqa: E402 +from colpali_engine.utils.torch_utils import ListDataset # noqa: E402 +from pdf2image import convert_from_path # noqa: E402 +from PIL import Image # noqa: E402 +from torch.utils.data import DataLoader # noqa: E402 +from tqdm import tqdm # noqa: E402 # Import the existing multi-vector implementation sys.path.append(str(_repo_root / "apps" / "multimodal" / "vision-based-pdf-multi-vector")) -from leann_multi_vector import LeannMultiVector +from leann_multi_vector import LeannMultiVector # noqa: E402 class ColQwenRAG: @@ -259,7 +259,7 @@ def ask(self, index_name: str, interactive: bool = False): elif not query: continue - results = self.search(index_name, query, top_k=3) + self.search(index_name, query, top_k=3) # TODO: Add answer generation with Qwen-VL print("\nπŸ’‘ For detailed answers, we can integrate Qwen-VL here!") diff --git a/packages/leann-backend-hnsw/third_party/faiss b/packages/leann-backend-hnsw/third_party/faiss index e2d243c4..59527452 160000 --- a/packages/leann-backend-hnsw/third_party/faiss +++ b/packages/leann-backend-hnsw/third_party/faiss @@ -1 +1 @@ -Subproject commit e2d243c40ddc142b8c57c067c0441694f3c22121 +Subproject commit 595274523790e3bb5991437c3fc6032f170ebad9 diff --git a/test_colqwen_reproduction.py b/test_colqwen_reproduction.py index 2af8d9c8..1e38d304 100644 --- a/test_colqwen_reproduction.py +++ b/test_colqwen_reproduction.py @@ -11,6 +11,7 @@ 5. Generate similarity maps """ +import importlib.util import os from pathlib import Path @@ -31,9 +32,14 @@ def main(): # Step 1: Check dependencies print("\nπŸ“¦ Checking dependencies...") try: - import pdf2image import torch - from colpali_engine.models import ColQwen2 + + # Check if pdf2image is available + if importlib.util.find_spec("pdf2image") is None: + raise ImportError("pdf2image not found") + # Check if colpali_engine is available + if importlib.util.find_spec("colpali_engine") is None: + raise ImportError("colpali_engine not found") print("βœ… Core dependencies available") print(f" - PyTorch: {torch.__version__}") From 13beb98164cbf16cf43277fa1f2f14b8c73ad622 Mon Sep 17 00:00:00 2001 From: aakash Date: Mon, 17 Nov 2025 13:52:44 -0800 Subject: [PATCH 3/7] Add CLIP-based image RAG application - Add apps/image_rag.py for indexing and searching images using CLIP embeddings - Supports text-based image search queries - Uses CLIP ViT-L/14 model via sentence-transformers - Follows the same pattern as other RAG apps in the apps directory - Addresses feature request for CLIP support in apps (issue #94) --- apps/image_rag.py | 218 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 218 insertions(+) create mode 100644 apps/image_rag.py diff --git a/apps/image_rag.py b/apps/image_rag.py new file mode 100644 index 00000000..4c33b691 --- /dev/null +++ b/apps/image_rag.py @@ -0,0 +1,218 @@ +#!/usr/bin/env python3 +""" +CLIP Image RAG Application + +This application enables RAG (Retrieval-Augmented Generation) on images using CLIP embeddings. +You can index a directory of images and search them using text queries. + +Usage: + python -m apps.image_rag --image-dir ./my_images/ --query "a sunset over mountains" + python -m apps.image_rag --image-dir ./my_images/ --interactive +""" + +import argparse +import pickle +import tempfile +from pathlib import Path + +import numpy as np +from PIL import Image +from sentence_transformers import SentenceTransformer +from tqdm import tqdm + +from apps.base_rag_example import BaseRAGExample + + +class ImageRAG(BaseRAGExample): + """ + RAG application for images using CLIP embeddings. + + This class provides a complete RAG pipeline for image data, including + CLIP embedding generation, indexing, and text-based image search. + """ + + def __init__(self): + super().__init__( + name="Image RAG", + description="RAG application for images using CLIP embeddings", + default_index_name="image_index", + ) + # Override default embedding model to use CLIP + self.embedding_model_default = "clip-ViT-L-14" + self.embedding_mode_default = "sentence-transformers" + self._image_data: list[dict] = [] + + def _add_specific_arguments(self, parser: argparse.ArgumentParser): + """Add image-specific arguments.""" + image_group = parser.add_argument_group("Image Parameters") + image_group.add_argument( + "--image-dir", + type=str, + required=True, + help="Directory containing images to index", + ) + image_group.add_argument( + "--image-extensions", + type=str, + nargs="+", + default=[".jpg", ".jpeg", ".png", ".gif", ".bmp", ".webp"], + help="Image file extensions to process (default: .jpg .jpeg .png .gif .bmp .webp)", + ) + image_group.add_argument( + "--batch-size", + type=int, + default=32, + help="Batch size for CLIP embedding generation (default: 32)", + ) + + async def load_data(self, args) -> list[str]: + """Load images, generate CLIP embeddings, and return text descriptions.""" + self._image_data = self._load_images_and_embeddings(args) + return [entry["text"] for entry in self._image_data] + + def _load_images_and_embeddings(self, args) -> list[dict]: + """Helper to process images and produce embeddings/metadata.""" + image_dir = Path(args.image_dir) + if not image_dir.exists(): + raise ValueError(f"Image directory does not exist: {image_dir}") + + print(f"πŸ“Έ Loading images from {image_dir}...") + + # Find all image files + image_files = [] + for ext in args.image_extensions: + image_files.extend(image_dir.rglob(f"*{ext}")) + image_files.extend(image_dir.rglob(f"*{ext.upper()}")) + + if not image_files: + raise ValueError( + f"No images found in {image_dir} with extensions {args.image_extensions}" + ) + + print(f"βœ… Found {len(image_files)} images") + + # Limit if max_items is set + if args.max_items > 0: + image_files = image_files[: args.max_items] + print(f"πŸ“Š Processing {len(image_files)} images (limited by --max-items)") + + # Load CLIP model + print("πŸ” Loading CLIP model...") + model = SentenceTransformer(self.embedding_model_default) + + # Process images and generate embeddings + print("πŸ–ΌοΈ Processing images and generating embeddings...") + image_data = [] + batch_images = [] + batch_paths = [] + + for image_path in tqdm(image_files, desc="Processing images"): + try: + image = Image.open(image_path).convert("RGB") + batch_images.append(image) + batch_paths.append(image_path) + + # Process in batches + if len(batch_images) >= args.batch_size: + embeddings = model.encode( + batch_images, + convert_to_numpy=True, + normalize_embeddings=True, + batch_size=args.batch_size, + show_progress_bar=False, + ) + + for img_path, embedding in zip(batch_paths, embeddings): + image_data.append( + { + "text": f"Image: {img_path.name}\nPath: {img_path}", + "metadata": { + "image_path": str(img_path), + "image_name": img_path.name, + "image_dir": str(image_dir), + }, + "embedding": embedding.astype(np.float32), + } + ) + + batch_images = [] + batch_paths = [] + + except Exception as e: + print(f"⚠️ Failed to process {image_path}: {e}") + continue + + # Process remaining images + if batch_images: + embeddings = model.encode( + batch_images, + convert_to_numpy=True, + normalize_embeddings=True, + batch_size=len(batch_images), + show_progress_bar=False, + ) + + for img_path, embedding in zip(batch_paths, embeddings): + image_data.append( + { + "text": f"Image: {img_path.name}\nPath: {img_path}", + "metadata": { + "image_path": str(img_path), + "image_name": img_path.name, + "image_dir": str(image_dir), + }, + "embedding": embedding.astype(np.float32), + } + ) + + print(f"βœ… Processed {len(image_data)} images") + return image_data + + async def build_index(self, args, texts: list[str]) -> str: + """Build index using pre-computed CLIP embeddings.""" + from leann.api import LeannBuilder + + if not self._image_data or len(self._image_data) != len(texts): + raise RuntimeError("No image data found. Make sure load_data() ran successfully.") + + print("πŸ”¨ Building LEANN index with CLIP embeddings...") + builder = LeannBuilder( + backend_name=args.backend_name, + embedding_model=self.embedding_model_default, + embedding_mode=self.embedding_mode_default, + is_recompute=False, + distance_metric="cosine", + graph_degree=args.graph_degree, + build_complexity=args.build_complexity, + is_compact=not args.no_compact, + ) + + for text, data in zip(texts, self._image_data): + builder.add_text(text=text, metadata=data["metadata"]) + + ids = [str(i) for i in range(len(self._image_data))] + embeddings = np.array([data["embedding"] for data in self._image_data], dtype=np.float32) + + with tempfile.NamedTemporaryFile(mode="wb", suffix=".pkl", delete=False) as f: + pickle.dump((ids, embeddings), f) + pkl_path = f.name + + try: + index_path = str(Path(args.index_dir) / f"{self.default_index_name}.leann") + builder.build_index_from_embeddings(index_path, pkl_path) + print(f"βœ… Index built successfully at {index_path}") + return index_path + finally: + Path(pkl_path).unlink() + + +def main(): + """Main entry point for the image RAG application.""" + import asyncio + + app = ImageRAG() + asyncio.run(app.run()) + + +if __name__ == "__main__": + main() From 86287d88326a3ae96b98509b4960d4ea2f8057d6 Mon Sep 17 00:00:00 2001 From: aakash Date: Wed, 3 Dec 2025 18:32:04 -0800 Subject: [PATCH 4/7] Revert unnecessary faiss submodule update Reset faiss submodule to match main branch to avoid unnecessary changes --- packages/leann-backend-hnsw/third_party/faiss | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/leann-backend-hnsw/third_party/faiss b/packages/leann-backend-hnsw/third_party/faiss index 59527452..e2d243c4 160000 --- a/packages/leann-backend-hnsw/third_party/faiss +++ b/packages/leann-backend-hnsw/third_party/faiss @@ -1 +1 @@ -Subproject commit 595274523790e3bb5991437c3fc6032f170ebad9 +Subproject commit e2d243c40ddc142b8c57c067c0441694f3c22121 From f13bd02fbd66348522782a114665ed5667e4dba4 Mon Sep 17 00:00:00 2001 From: aakash Date: Sat, 6 Dec 2025 03:28:08 -0800 Subject: [PATCH 5/7] docs: Add ColQwen multimodal PDF retrieval to README Add brief introduction and usage guide for ColQwen integration, similar to other RAG application sections in the README. - Quick start examples for building, searching, and interactive Q&A - Setup instructions with prerequisites - Model options (ColQwen2 vs ColPali) - Link to detailed ColQwen guide --- README.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/README.md b/README.md index 0b0e59f4..4bc36277 100755 --- a/README.md +++ b/README.md @@ -379,6 +379,54 @@ python -m apps.code_rag --repo-dir "./my_codebase" --query "How does authenticat +### 🎨 ColQwen: Multimodal PDF Retrieval with Vision-Language Models + +Search through PDFs using both text and visual understanding with ColQwen2/ColPali models. Perfect for research papers, technical documents, and any PDFs with complex layouts, figures, or diagrams. + +> **🍎 Mac Users**: ColQwen is optimized for Apple Silicon with MPS acceleration for faster inference! + +```bash +# Build index from PDFs +python -m apps.colqwen_rag build --pdfs ./my_papers/ --index research_papers + +# Search with text queries +python -m apps.colqwen_rag search research_papers "How does attention mechanism work?" + +# Interactive Q&A +python -m apps.colqwen_rag ask research_papers --interactive +``` + +
+πŸ“‹ Click to expand: ColQwen Setup & Usage + +#### Prerequisites +```bash +# Install dependencies +uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn +brew install poppler # macOS only, for PDF processing +``` + +#### Build Index +```bash +python -m apps.colqwen_rag build \ + --pdfs ./pdf_directory/ \ + --index my_index \ + --model colqwen2 # or colpali +``` + +#### Search +```bash +python -m apps.colqwen_rag search my_index "your question here" --top-k 5 +``` + +#### Models +- **ColQwen2** (`colqwen2`): Latest vision-language model with improved performance +- **ColPali** (`colpali`): Proven multimodal retriever + +For detailed usage, see the [ColQwen Guide](COLQWEN_GUIDE.md). + +
+ ### πŸ“§ Your Personal Email Secretary: RAG on Apple Mail! > **Note:** The examples below currently support macOS only. Windows support coming soon. From af47dfdde7699b68381751edda698dcf56041436 Mon Sep 17 00:00:00 2001 From: aakash Date: Sat, 6 Dec 2025 03:33:02 -0800 Subject: [PATCH 6/7] fix: Update ColQwen guide link to docs/ directory --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4bc36277..e549739f 100755 --- a/README.md +++ b/README.md @@ -423,7 +423,7 @@ python -m apps.colqwen_rag search my_index "your question here" --top-k 5 - **ColQwen2** (`colqwen2`): Latest vision-language model with improved performance - **ColPali** (`colpali`): Proven multimodal retriever -For detailed usage, see the [ColQwen Guide](COLQWEN_GUIDE.md). +For detailed usage, see the [ColQwen Guide](docs/COLQWEN_GUIDE.md). From 0175bc9c20c4ddbfe0d28430aff97d0a13e07d84 Mon Sep 17 00:00:00 2001 From: aakash Date: Sun, 7 Dec 2025 09:57:14 -0800 Subject: [PATCH 7/7] docs: Add ColQwen guide to docs directory Add COLQWEN_GUIDE.md to docs/ directory for proper documentation structure. This file is referenced in the README and needs to be tracked in git. --- docs/COLQWEN_GUIDE.md | 200 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 200 insertions(+) create mode 100644 docs/COLQWEN_GUIDE.md diff --git a/docs/COLQWEN_GUIDE.md b/docs/COLQWEN_GUIDE.md new file mode 100644 index 00000000..42772f62 --- /dev/null +++ b/docs/COLQWEN_GUIDE.md @@ -0,0 +1,200 @@ +# ColQwen Integration Guide + +Easy-to-use multimodal PDF retrieval with ColQwen2/ColPali models. + +## Quick Start + +> **🍎 Mac Users**: ColQwen is optimized for Apple Silicon with MPS acceleration for faster inference! + +### 1. Install Dependencies +```bash +uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn +brew install poppler # macOS only, for PDF processing +``` + +### 2. Basic Usage +```bash +# Build index from PDFs +python -m apps.colqwen_rag build --pdfs ./my_papers/ --index research_papers + +# Search with text queries +python -m apps.colqwen_rag search research_papers "How does attention mechanism work?" + +# Interactive Q&A +python -m apps.colqwen_rag ask research_papers --interactive +``` + +## Commands + +### Build Index +```bash +python -m apps.colqwen_rag build \ + --pdfs ./pdf_directory/ \ + --index my_index \ + --model colqwen2 \ + --pages-dir ./page_images/ # Optional: save page images +``` + +**Options:** +- `--pdfs`: Directory containing PDF files (or single PDF path) +- `--index`: Name for the index (required) +- `--model`: `colqwen2` (default) or `colpali` +- `--pages-dir`: Directory to save page images (optional) + +### Search Index +```bash +python -m apps.colqwen_rag search my_index "your question here" --top-k 5 +``` + +**Options:** +- `--top-k`: Number of results to return (default: 5) +- `--model`: Model used for search (should match build model) + +### Interactive Q&A +```bash +python -m apps.colqwen_rag ask my_index --interactive +``` + +**Commands in interactive mode:** +- Type your questions naturally +- `help`: Show available commands +- `quit`/`exit`/`q`: Exit interactive mode + +## πŸ§ͺ Test & Reproduce Results + +Run the reproduction test for issue #119: +```bash +python test_colqwen_reproduction.py +``` + +This will: +1. βœ… Check dependencies +2. πŸ“₯ Download sample PDF (Attention Is All You Need paper) +3. πŸ—οΈ Build test index +4. πŸ” Run sample queries +5. πŸ“Š Show how to generate similarity maps + +## 🎨 Advanced: Similarity Maps + +For visual similarity analysis, use the existing advanced script: +```bash +cd apps/multimodal/vision-based-pdf-multi-vector/ +python multi-vector-leann-similarity-map.py +``` + +Edit the script to customize: +- `QUERY`: Your question +- `MODEL`: "colqwen2" or "colpali" +- `USE_HF_DATASET`: Use HuggingFace dataset or local PDFs +- `SIMILARITY_MAP`: Generate heatmaps +- `ANSWER`: Enable Qwen-VL answer generation + +## πŸ”§ How It Works + +### ColQwen2 vs ColPali +- **ColQwen2** (`vidore/colqwen2-v1.0`): Latest vision-language model +- **ColPali** (`vidore/colpali-v1.2`): Proven multimodal retriever + +### Architecture +1. **PDF β†’ Images**: Convert PDF pages to images (150 DPI) +2. **Vision Encoding**: Process images with ColQwen2/ColPali +3. **Multi-Vector Index**: Build LEANN HNSW index with multiple embeddings per page +4. **Query Processing**: Encode text queries with same model +5. **Similarity Search**: Find most relevant pages/regions +6. **Visual Maps**: Generate attention heatmaps (optional) + +### Device Support +- **CUDA**: Best performance with GPU acceleration +- **MPS**: Apple Silicon Mac support +- **CPU**: Fallback for any system (slower) + +Auto-detection: CUDA > MPS > CPU + +## πŸ“Š Performance Tips + +### For Best Performance: +```bash +# Use ColQwen2 for latest features +--model colqwen2 + +# Save page images for reuse +--pages-dir ./cached_pages/ + +# Adjust batch size based on GPU memory +# (automatically handled) +``` + +### For Large Document Sets: +- Process PDFs in batches +- Use SSD storage for index files +- Consider using CUDA if available + +## πŸ”— Related Resources + +- **Fast-PLAID**: https://github.com/lightonai/fast-plaid +- **Pylate**: https://github.com/lightonai/pylate +- **ColBERT**: https://github.com/stanford-futuredata/ColBERT +- **ColPali Paper**: Vision-Language Models for Document Retrieval +- **Issue #119**: https://github.com/yichuan-w/LEANN/issues/119 + +## πŸ› Troubleshooting + +### PDF Conversion Issues (macOS) +```bash +# Install poppler +brew install poppler +which pdfinfo && pdfinfo -v +``` + +### Memory Issues +- Reduce batch size (automatically handled) +- Use CPU instead of GPU: `export CUDA_VISIBLE_DEVICES=""` +- Process fewer PDFs at once + +### Model Download Issues +- Ensure internet connection for first run +- Models are cached after first download +- Use HuggingFace mirrors if needed + +### Import Errors +```bash +# Ensure all dependencies installed +uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn + +# Check PyTorch installation +python -c "import torch; print(torch.__version__)" +``` + +## πŸ’‘ Examples + +### Research Paper Analysis +```bash +# Index your research papers +python -m apps.colqwen_rag build --pdfs ~/Papers/AI/ --index ai_papers + +# Ask research questions +python -m apps.colqwen_rag search ai_papers "What are the limitations of transformer models?" +python -m apps.colqwen_rag search ai_papers "How does BERT compare to GPT?" +``` + +### Document Q&A +```bash +# Index business documents +python -m apps.colqwen_rag build --pdfs ~/Documents/Reports/ --index reports + +# Interactive analysis +python -m apps.colqwen_rag ask reports --interactive +``` + +### Visual Analysis +```bash +# Generate similarity maps for specific queries +cd apps/multimodal/vision-based-pdf-multi-vector/ +# Edit multi-vector-leann-similarity-map.py with your query +python multi-vector-leann-similarity-map.py +# Check ./figures/ for generated heatmaps +``` + +--- + +**🎯 This integration makes ColQwen as easy to use as other LEANN features while maintaining the full power of multimodal document understanding!**