Skip to content

Conversation

@ASuresh0524
Copy link
Collaborator

Fixes #175

Problem:
When --file-types .pdf is specified, PDFs were being processed twice:

  1. Separately with PyMuPDF/pdfplumber extractors
  2. Again in the 'other file types' section via SimpleDirectoryReader

This caused duplicate processing and potential conflicts.

Solution:

  • Exclude .pdf from other_file_extensions when PDFs are already processed separately
  • Only load other file types if there are extensions to process
  • Prevents duplicate PDF processing

Changes:

  • Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately
  • Updated SimpleDirectoryReader to use filtered extensions
  • Added check to skip loading if no other extensions to process

Checklist

  • Tests pass (uv run pytest)
  • Code formatted (ruff format and ruff check)
  • Pre-commit hooks pass (pre-commit run --all-files)

Fixes #175

Problem:
When --file-types .pdf is specified, PDFs were being processed twice:
1. Separately with PyMuPDF/pdfplumber extractors
2. Again in the 'other file types' section via SimpleDirectoryReader

This caused duplicate processing and potential conflicts.

Solution:
- Exclude .pdf from other_file_extensions when PDFs are already
  processed separately
- Only load other file types if there are extensions to process
- Prevents duplicate PDF processing

Changes:
- Added logic to filter out .pdf from code_extensions when loading
  other file types if PDFs were processed separately
- Updated SimpleDirectoryReader to use filtered extensions
- Added check to skip loading if no other extensions to process
@yichuan-w
Copy link
Owner

LGTM, this PR makes the code more robust to process pdf

@yichuan-w yichuan-w merged commit e268392 into main Dec 1, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

it run with pdf

3 participants