Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to extract text from PDF generated by Word. #217

Closed
msuiche opened this issue Jan 24, 2023 · 8 comments · Fixed by #328
Closed

Unable to extract text from PDF generated by Word. #217

msuiche opened this issue Jan 24, 2023 · 8 comments · Fixed by #328

Comments

@msuiche
Copy link

msuiche commented Jan 24, 2023

I saved a Word document as a PDF, and when I try to extract the text I get the following errors:

[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }

And the output content looks like this:

"R\n\"\n\"\n.0$((\" A*\" &1\" $++&’&51\" ’5\" $((\" 5’0*,\" ,*-*+&*.\" $2$&($A(*\" $’\" ($>\" 5,\" &1\" *K/&’=F\" ./A]*:’\" ’5\" $1=\" *Q%,*..\" \n*Q:(/.&51.\"5,\"(&-&’$’&51. \n\"\n&1\"’0&.\"B3,**-*1’\"’5\"’0*\":51’,$,=C \n\"\n\"\n\"\n!L\n\"\n!\n!’1$3’&%&><(&)’ \n+\n\"\n9*,2&:*\" ;,52&+*,\" .0$((\" +*4*1+F\" &1+*-1&4=\" $1+\" 05(+\" 0$,-(*..\" \n’0*\" #5-%$1=\" \n$1+\"\n&’. \n\"\n./A.&+&$,&*.F\" \n$44&(&$’*.F\" $1+\" ,*.%*:’&2*\" 544&:*,.F\" +&,*:’5,.F\" *-%(5=**.F\" $3*1’.F\" . \n/::*..5,.\" $1+\" %*,-&’’*+\" $..&31. \n\"\nG*$:0F\" $ \n\"\n7\n#5-%$1= \n\"\n@1+*-1&’**8H\" 4,5-\" $1+\" $3$&1.’\" $((\" (5..*.F\" +$-$3*.F\" (&$A&(&’&*.F\" +*4&:&*1:&*.F\" \n$:’&51.F\"]/+3-*1’.F\"&1’*,*.’F\"$>$,+.F\"%*1$(’&*.F\"4&1*.F\":5.’.\"5,\"*Q%*1.*.\"54\ (...)

I tried using pdfutil with the extract_text subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?

@jrmuizel
Copy link
Contributor

Can you share the document?

@dertuxmalwieder
Copy link

Ádding myself here. It looks like Word generates different PDFs.

; curl https://www.africau.edu/images/default/sample.pdf

%PDF-1.3
%����

1 0 obj
<<
...

Now, one generated with Word (original source URL):

; head ./Sozialismusvorstellungen-der-DKP.pdf
%����1.2
�treamr /LZWDecode
 ��P�[�������7�8����d6��+�шҸ6�ׅ�1���m����T�#���̆�(��;:Pgf3Ft�l��������=�M��Y��i:`k�A�s���,Ƞú����HO��+�
                   WRgy��������-��<lAZ��
�̰�p�2�pb�.��Z#��2����
streamr /LZWDecode    ��v5�Ø�7�Ø�9B�`¥%n@���
...

I imagine that there needs to be additional decoding?

@thespooler
Copy link

Similarly, docs generated with LibreOffice seems to also not work.
For example, running this PDF from Richard Stallman website through extract_text will output this:

{"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]}

@jrmuizel
Copy link
Contributor

Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract

jymchng referenced this issue Apr 15, 2023
Co-authored-by: Lukáš Tyrychtr <ltyrycht@redhat.com>
@kinxiel
Copy link

kinxiel commented May 12, 2023

I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine.

Although both of these types of PDFs work fine with Python based PDF libraries.

@shivjm
Copy link

shivjm commented Jul 30, 2023

Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues.

@katzeprior
Copy link

Anyone looking into fixing this?

@Heinenen
Copy link
Collaborator

Heinenen commented Aug 7, 2024

@katzeprior the problem is probably the same as in #125, and it looks like that may be solved in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants