Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pdf): Fix garbled text in PDF loaders #29557

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

engkimo
Copy link

@engkimo engkimo commented Feb 3, 2025

Hi team,

This PR implements the first part of the fix for the garbled text issue (#29555), starting with PDF loaders. This is part of a larger effort to address encoding issues across multiple document loaders.

スクリーンショット 2025-02-04 0 22 43

Changes

  • Add PDFEncoding enum for standardized encoding options
  • Implement PDFEncodingConfig for centralized encoding management
  • Enhance BasePDFLoader with encoding configuration support
  • Update PDFMinerParser to handle encoding properly
  • Add fallback mechanism for encoding errors

Next Steps

This PR focuses on PDF loaders as the first implementation. Similar fixes will be needed for:

  • Docx2txtLoader
  • PyPDFLoader
  • TextLoader
  • UnstructuredExcelLoader
  • UnstructuredWordDocumentLoader

Notes

  • No breaking changes to existing API
  • Default encoding remains UTF-8
  • Users can now specify custom encoding configurations when needed
  • Fixes regression introduced in o3 release for PDF loaders

Environment

  • Tested on Python 3.13.1
  • Compatible with:
    • langchain 0.3.17
    • langchain_community 0.3.16
    • langchain_core 0.3.33

Related issue: #29555

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 3, 2025
Copy link

vercel bot commented Feb 3, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Feb 3, 2025 3:37pm

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Feb 3, 2025
@engkimo engkimo marked this pull request as draft February 4, 2025 00:23
@efriis
Copy link
Contributor

efriis commented Feb 8, 2025

hey there! is the issue that there isn't a way to pass in an encoding to these loaders currently, or is the issue that there isn't an encoding abstraction that is compatible with these different loaders?

I would rather just make sure we're passing through kwargs you might want to use to configure these underlying libraries, and skip standardizing the PDF encoding definition because different PDF libraries might have other flags you might want to configure as well, and we're not aiming to be a full abstraction layer for all pdf parsing libraries' configuration

@efriis efriis self-assigned this Feb 8, 2025
@engkimo
Copy link
Author

engkimo commented Feb 10, 2025

@efriis
Hi! Thank you for the clarification. The encoding issues affect multiple loaders (PDFMinerLoader, Docx2txtLoader, PyPDFLoader, TextLoader, UnstructuredExcelLoader, UnstructuredWordDocumentLoader, etc.).

Would this approach work better?

  1. Add **kwargs to the BaseLoader class init
  2. Pass these kwargs through to each underlying document parsing library
  3. Document the encoding options available for each loader type

This would:

  • Maintain flexibility for each document library's unique features
  • Allow consistent encoding configuration across different loader types
  • Keep LangChain's role as a thin wrapper rather than a full abstraction layer

Does this align better with LangChain's philosophy while addressing the encoding issues across multiple loader types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

2 participants