Change in PDF Extraction Results #30

TheTechromancer · 2024-11-18T03:40:44Z

Hi, today I noticed a sudden change in the way text is extracted from PDFs. It seems like a lot of the binary content is being included. This is causing our tests to fail:

We've been able to resolve this quickly on our end by downgrading the package version; but just wanted to give you guys a heads-up.

EDIT: On further investigation, it looks like a change in the python API caused the issue:

Traceback (most recent call last):
  File "/home/bls/Downloads/code/bbot/bbot/modules/extractous.py", line 135, in extract_text
    buffer = reader.read(4096)
             ^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'read'

nmammeri · 2024-11-18T18:19:40Z

Thanks for @TheTechromancer reporting this. In version 0.2.0, we changed the API to return a tuple of reader and metadata. add this to your extract call: reader, metada = extractor.extract_ ...
Please look at the updated Docs

TheTechromancer · 2024-11-18T18:47:14Z

Thanks yeah we were able to fix it. Is there a chance there will be another breaking API change without a major version increase? If so, going forward we can pin the version on our side.

nmammeri · 2024-11-19T08:10:03Z

I don't see any breaking changes coming up, you can pin your version

TheTechromancer mentioned this issue Nov 18, 2024

Update Extractous with new API changes blacklanternsecurity/bbot#1976

Merged

TheTechromancer closed this as completed Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change in PDF Extraction Results #30

Change in PDF Extraction Results #30

TheTechromancer commented Nov 18, 2024 •

edited

Loading

nmammeri commented Nov 18, 2024

TheTechromancer commented Nov 18, 2024

nmammeri commented Nov 19, 2024

Change in PDF Extraction Results #30

Change in PDF Extraction Results #30

Comments

TheTechromancer commented Nov 18, 2024 • edited Loading

nmammeri commented Nov 18, 2024

TheTechromancer commented Nov 18, 2024

nmammeri commented Nov 19, 2024

TheTechromancer commented Nov 18, 2024 •

edited

Loading