-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to extract text from PDF generated by Word. #217
Comments
Can you share the document? |
Ádding myself here. It looks like Word generates different PDFs.
Now, one generated with Word (original source URL):
I imagine that there needs to be additional decoding? |
Similarly, docs generated with LibreOffice seems to also not work. {"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]} |
Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract |
I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine. Although both of these types of PDFs work fine with Python based PDF libraries. |
Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues. |
Anyone looking into fixing this? |
@katzeprior the problem is probably the same as in #125, and it looks like that may be solved in the near future. |
I saved a Word document as a PDF, and when I try to extract the text I get the following errors:
And the output content looks like this:
I tried using
pdfutil
with theextract_text
subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?The text was updated successfully, but these errors were encountered: