Unable to extract text from PDF generated by Word. #217

msuiche · 2023-01-24T14:07:27Z

I saved a Word document as a PDF, and when I try to extract the text I get the following errors:

[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }

And the output content looks like this:

"R\n\"\n\"\n.0$((\" A*\" &1\" $++&’&51\" ’5\" $((\" 5’0*,\" ,*-*+&*.\" $2$&($A(*\" $’\" ($>\" 5,\" &1\" *K/&’=F\" ./A]*:’\" ’5\" $1=\" *Q%,*..\" \n*Q:(/.&51.\"5,\"(&-&’$’&51. \n\"\n&1\"’0&.\"B3,**-*1’\"’5\"’0*\":51’,$,=C \n\"\n\"\n\"\n!L\n\"\n!\n!’1$3’&%&><(&)’ \n+\n\"\n9*,2&:*\" ;,52&+*,\" .0$((\" +*4*1+F\" &1+*-1&4=\" $1+\" 05(+\" 0$,-(*..\" \n’0*\" #5-%$1=\" \n$1+\"\n&’. \n\"\n./A.&+&$,&*.F\" \n$44&(&$’*.F\" $1+\" ,*.%*:’&2*\" 544&:*,.F\" +&,*:’5,.F\" *-%(5=**.F\" $3*1’.F\" . \n/::*..5,.\" $1+\" %*,-&’’*+\" $..&31. \n\"\nG*$:0F\" $ \n\"\n7\n#5-%$1= \n\"\n@1+*-1&’**8H\" 4,5-\" $1+\" $3$&1.’\" $((\" (5..*.F\" +$-$3*.F\" (&$A&(&’&*.F\" +*4&:&*1:&*.F\" \n$:’&51.F\"]/+3-*1’.F\"&1’*,*.’F\"$>$,+.F\"%*1$(’&*.F\"4&1*.F\":5.’.\"5,\"*Q%*1.*.\"54\ (...)

I tried using pdfutil with the extract_text subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?

The text was updated successfully, but these errors were encountered:

jrmuizel · 2023-02-12T19:01:45Z

Can you share the document?

dertuxmalwieder · 2023-03-07T01:54:23Z

Ádding myself here. It looks like Word generates different PDFs.

; curl https://www.africau.edu/images/default/sample.pdf

%PDF-1.3
%����

1 0 obj
<<
...

Now, one generated with Word (original source URL):

; head ./Sozialismusvorstellungen-der-DKP.pdf
%����1.2
�treamr /LZWDecode
 ��P�[�������7�8����d6��+�шҸ6�ׅ�1���m����T�#���̆�(��;:Pgf3Ft�l��������=�M��Y��i:`k�A�s���,Ƞú����HO��+�
                   WRgy��������-��<lAZ��
�̰�p�2�pb�.��Z#��2����
streamr /LZWDecode    ��v5�Ø�7�Ø�9B�`¥%n@���
...

I imagine that there needs to be additional decoding?

thespooler · 2023-04-10T21:37:23Z

Similarly, docs generated with LibreOffice seems to also not work.
For example, running this PDF from Richard Stallman website through extract_text will output this:

{"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]}

jrmuizel · 2023-04-11T14:28:47Z

Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract

Co-authored-by: Lukáš Tyrychtr <ltyrycht@redhat.com>

kinxiel · 2023-05-12T09:50:03Z

I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine.

Although both of these types of PDFs work fine with Python based PDF libraries.

shivjm · 2023-07-30T21:02:27Z

Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues.

katzeprior · 2024-08-05T13:17:25Z

Anyone looking into fixing this?

Heinenen · 2024-08-07T22:38:25Z

@katzeprior the problem is probably the same as in #125, and it looks like that may be solved in the near future.

jymchng mentioned this issue Mar 27, 2023

.replace_text does not work as intended. #223

Open

jymchng referenced this issue Apr 15, 2023

Replace unmaitained encoding crate with encoding_rs (#222)

99cb2a4

Co-authored-by: Lukáš Tyrychtr <ltyrycht@redhat.com>

This was referenced Aug 7, 2024

extract_text inserts newlines where it shouldn't #292

Open

Implement decoding of Unicode characters #125

Closed

dkaluza mentioned this issue Sep 14, 2024

Implement ToUnicode for variadic len encodings #328

Merged

J-F-Liu closed this as completed in #328 Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to extract text from PDF generated by Word. #217

Unable to extract text from PDF generated by Word. #217

msuiche commented Jan 24, 2023

jrmuizel commented Feb 12, 2023

dertuxmalwieder commented Mar 7, 2023

thespooler commented Apr 10, 2023

jrmuizel commented Apr 11, 2023

kinxiel commented May 12, 2023 •

edited

Loading

shivjm commented Jul 30, 2023 •

edited

Loading

katzeprior commented Aug 5, 2024

Heinenen commented Aug 7, 2024

Unable to extract text from PDF generated by Word. #217

Unable to extract text from PDF generated by Word. #217

Comments

msuiche commented Jan 24, 2023

jrmuizel commented Feb 12, 2023

dertuxmalwieder commented Mar 7, 2023

thespooler commented Apr 10, 2023

jrmuizel commented Apr 11, 2023

kinxiel commented May 12, 2023 • edited Loading

shivjm commented Jul 30, 2023 • edited Loading

katzeprior commented Aug 5, 2024

Heinenen commented Aug 7, 2024

kinxiel commented May 12, 2023 •

edited

Loading

shivjm commented Jul 30, 2023 •

edited

Loading