Skip to content

Conversation

@ASuresh0524
Copy link
Collaborator

Fix: Prevent duplicate PDF processing when using --file-types .pdf

Fixes #175

Problem

When --file-types .pdf is specified, PDFs were being processed twice:

  1. Separately with PyMuPDF/pdfplumber extractors (lines 1118-1163)
  2. Again in the 'other file types' section via SimpleDirectoryReader (line 1184)

This caused duplicate processing and potential conflicts.

Solution

  • Exclude .pdf from other_file_extensions when PDFs are already processed separately
  • Only load other file types if there are extensions to process
  • Prevents duplicate PDF processing

Changes

  • Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately
  • Updated SimpleDirectoryReader to use filtered extensions
  • Added check to skip loading if no other extensions to process

Testing

Tested with --file-types .pdf to ensure PDFs are processed only once.

- Add ColQwenRAG class with easy-to-use CLI for multimodal PDF retrieval
- Support for both ColQwen2 and ColPali models with automatic device selection
- MPS optimization for Apple Silicon with memory-efficient loading
- Complete pipeline: PDF→images→embeddings→HNSW index→search
- Multi-vector indexing for fine-grained document matching
- Comprehensive user guide and reproduction test script
- Resolves #119: ColQwen Doc and Support Management

Features:
- python -m apps.colqwen_rag build --pdfs ./pdfs/ --index my_index
- python -m apps.colqwen_rag search my_index "query text"
- python -m apps.colqwen_rag ask my_index --interactive
- Automatic CPU fallback for memory constraints
- Robust error handling and progress tracking
- Add noqa comments for E402 errors (imports after sys.path modifications)
- Remove unused variable assignment in colqwen_rag.py
- Use importlib.util.find_spec for dependency checks instead of unused imports
- Fix import ordering in test_colqwen_reproduction.py
- Add apps/image_rag.py for indexing and searching images using CLIP embeddings
- Supports text-based image search queries
- Uses CLIP ViT-L/14 model via sentence-transformers
- Follows the same pattern as other RAG apps in the apps directory
- Addresses feature request for CLIP support in apps (issue #94)
Fixes #175

Problem:
When --file-types .pdf is specified, PDFs were being processed twice:
1. Separately with PyMuPDF/pdfplumber extractors
2. Again in the 'other file types' section via SimpleDirectoryReader

This caused duplicate processing and potential conflicts.

Solution:
- Exclude .pdf from other_file_extensions when PDFs are already
  processed separately
- Only load other file types if there are extensions to process
- Prevents duplicate PDF processing

Changes:
- Added logic to filter out .pdf from code_extensions when loading
  other file types if PDFs were processed separately
- Updated SimpleDirectoryReader to use filtered extensions
- Added check to skip loading if no other extensions to process
@yichuan-w
Copy link
Owner

again please make sure this PR is separate for the issue and do not submit colqwen one, you should use a new branch from main
And do not change faiss

@yichuan-w
Copy link
Owner

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

it run with pdf

3 participants