Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in PDF Extraction Results #30

Closed
TheTechromancer opened this issue Nov 18, 2024 · 3 comments
Closed

Change in PDF Extraction Results #30

TheTechromancer opened this issue Nov 18, 2024 · 3 comments

Comments

@TheTechromancer
Copy link
Contributor

TheTechromancer commented Nov 18, 2024

Hi, today I noticed a sudden change in the way text is extracted from PDFs. It seems like a lot of the binary content is being included. This is causing our tests to fail:

image

We've been able to resolve this quickly on our end by downgrading the package version; but just wanted to give you guys a heads-up.

EDIT: On further investigation, it looks like a change in the python API caused the issue:

Traceback (most recent call last):
  File "/home/bls/Downloads/code/bbot/bbot/modules/extractous.py", line 135, in extract_text
    buffer = reader.read(4096)
             ^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'read'
@nmammeri
Copy link
Contributor

Thanks for @TheTechromancer reporting this. In version 0.2.0, we changed the API to return a tuple of reader and metadata. add this to your extract call: reader, metada = extractor.extract_ ...
Please look at the updated Docs

@TheTechromancer
Copy link
Contributor Author

Thanks yeah we were able to fix it. Is there a chance there will be another breaking API change without a major version increase? If so, going forward we can pin the version on our side.

@nmammeri
Copy link
Contributor

I don't see any breaking changes coming up, you can pin your version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants