Fix/pdf duplicate processing 175 #178

ASuresh0524 · 2025-11-26T01:24:40Z

Fix: Prevent duplicate PDF processing when using --file-types .pdf

Fixes #175

Problem

When --file-types .pdf is specified, PDFs were being processed twice:

Separately with PyMuPDF/pdfplumber extractors (lines 1118-1163)
Again in the 'other file types' section via SimpleDirectoryReader (line 1184)

This caused duplicate processing and potential conflicts.

Solution

Exclude .pdf from other_file_extensions when PDFs are already processed separately
Only load other file types if there are extensions to process
Prevents duplicate PDF processing

Changes

Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately
Updated SimpleDirectoryReader to use filtered extensions
Added check to skip loading if no other extensions to process

Testing

Tested with --file-types .pdf to ensure PDFs are processed only once.

- Add ColQwenRAG class with easy-to-use CLI for multimodal PDF retrieval - Support for both ColQwen2 and ColPali models with automatic device selection - MPS optimization for Apple Silicon with memory-efficient loading - Complete pipeline: PDF→images→embeddings→HNSW index→search - Multi-vector indexing for fine-grained document matching - Comprehensive user guide and reproduction test script - Resolves #119: ColQwen Doc and Support Management Features: - python -m apps.colqwen_rag build --pdfs ./pdfs/ --index my_index - python -m apps.colqwen_rag search my_index "query text" - python -m apps.colqwen_rag ask my_index --interactive - Automatic CPU fallback for memory constraints - Robust error handling and progress tracking

- Add noqa comments for E402 errors (imports after sys.path modifications) - Remove unused variable assignment in colqwen_rag.py - Use importlib.util.find_spec for dependency checks instead of unused imports - Fix import ordering in test_colqwen_reproduction.py

- Add apps/image_rag.py for indexing and searching images using CLIP embeddings - Supports text-based image search queries - Uses CLIP ViT-L/14 model via sentence-transformers - Follows the same pattern as other RAG apps in the apps directory - Addresses feature request for CLIP support in apps (issue #94)

Fixes #175 Problem: When --file-types .pdf is specified, PDFs were being processed twice: 1. Separately with PyMuPDF/pdfplumber extractors 2. Again in the 'other file types' section via SimpleDirectoryReader This caused duplicate processing and potential conflicts. Solution: - Exclude .pdf from other_file_extensions when PDFs are already processed separately - Only load other file types if there are extensions to process - Prevents duplicate PDF processing Changes: - Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately - Updated SimpleDirectoryReader to use filtered extensions - Added check to skip loading if no other extensions to process

yichuan-w · 2025-11-27T09:18:40Z

again please make sure this PR is separate for the issue and do not submit colqwen one, you should use a new branch from main
And do not change faiss

yichuan-w · 2025-11-27T09:18:55Z

Thanks!!

ASuresh0524 added 4 commits November 10, 2025 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/pdf duplicate processing 175 #178

Fix/pdf duplicate processing 175 #178

Uh oh!

ASuresh0524 commented Nov 26, 2025

Uh oh!

yichuan-w commented Nov 27, 2025

Uh oh!

yichuan-w commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix/pdf duplicate processing 175 #178

Are you sure you want to change the base?

Fix/pdf duplicate processing 175 #178

Uh oh!

Conversation

ASuresh0524 commented Nov 26, 2025

Fix: Prevent duplicate PDF processing when using --file-types .pdf

Problem

Solution

Changes

Testing

Uh oh!

yichuan-w commented Nov 27, 2025

Uh oh!

yichuan-w commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants