Garbled characters and missed text. #1142

Lightup1 · 2022-07-21T06:19:31Z

There are garbled characters and some text, which can be copied inside a viewer, can not be found in the exctract_text()

Environment

$ python -m platform
Windows-10-10.0.22000-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.6.0

Code + PDF

This is a minimal, complete example that shows the issue (with #1136 to fix keyerror "\W"):

import os
import PyPDF2
pdf_filename="ST辅仁：2019年年度报告.PDF"
txt_filename="ST辅仁：2019年年度报告.txt"
with open(pdf_filename, 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfReader(pdfFileObj)
        x=pdfReader.numPages
        with open(txt_filename,"a",encoding='utf-16') as file1:
                for i in range(x):
                    pageobj=pdfReader.pages[i]
                    text=pageobj.extract_text()
                    file1.writelines(text)

ST辅仁：2019年年度报告.PDF

Original PDF snippets

page 7/254:

copied from pdf file: 第二季度

TXT file finder

The txt editor told that there is no "第二季度" inside the txt file. Notice that it's only one example in this PDF file.

Example of garbled characters

�]
-Añ-��J �7 �]
-Añ�h1Ñ*6��
ˆ�J
�:�Ô�p �7 �:#§Añ�h�Ô�C�p
�\ œ
(�Ã œ
( �7 Eµ�ñ9Ÿ�JLö
��f9Ÿ6Ñ�-�9L€ œ
(

Eµ�ñ�2 �7 "ã ‡Eµ�ñ�2�f9Ÿ�9L€ œ
(
Eµ�ñLö
� �7 Eµ�ñ9Ÿ�JLö
��9L€ œ
(
�09ŸLö
� �7 �0�1�f9Ÿ�ÄLö
��Å�9L€ œ
(
�»"ãG‚�J �7 "ã ‡-1�»"ãG‚�J6Ñ�-�9L€ œ

here is the txt extracted:
ST辅仁：2019年年度报告.txt

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-07-21T06:22:51Z

As a side-note: You're running the current main of PyPDF2 on Github, not the PyPI version, right?

Lightup1 · 2022-07-21T06:25:34Z

Yes it's running on current main of PyPDF2 on Github and this issue is also reproducible on PyPDF2 2.5.0 as I tested

pubpub-zz · 2022-07-22T21:05:33Z

the PDF is a tagged PDF. the font is not explicitly reset to chinese : I guess it is using the "tagging" solution but I've not yet understand it.
If someone has some experience in, help would be welcomed

pubpub-zz · 2022-07-24T14:43:18Z

@MartinThoma , can you tag this issue "help wanted" / "difficult high"

Graphic state shall store also the font, font size,....

pubpub-zz · 2022-07-26T21:20:26Z

@Lightup1
Got it!
I've checked the rendering in pdf.js and it was good. It has been quite easy to add some debug in pdf.js to trace how the font was restored : actually it is not linked to tagged pdf ; the issue was in the saved graphic context/state that was not saving the text context (that includes the font name/size)

thanks to push me into creating a new way to carry on investigation using pdf.js😀

Graphic state shall store also the font, font size, ... See #1142

pubpub-zz · 2022-07-27T19:34:46Z

@Lightup1,
do you want to do some extra-tests before closing ?

Lightup1 · 2022-07-28T02:45:39Z

Hi @pubpub-zz , thanks for your work. It’s okay to close I think.

MartinThoma added help wanted We appreciate help everywhere - this one might be an easy start! Difficulty: High labels Jul 24, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 26, 2022

BUG : incomplete Graphic State save/restore(py-pdf#1142)

c7382e2

Graphic state shall store also the font, font size,....

pubpub-zz mentioned this issue Jul 26, 2022

BUG : incomplete Graphic State save/restore(#1142) #1172

Merged

MartinThoma pushed a commit that referenced this issue Jul 27, 2022

BUG: Incomplete Graphic State save/restore (#1172)

d8bd12f

Graphic state shall store also the font, font size, ... See #1142

Lightup1 closed this as completed Jul 28, 2022

srogmann mentioned this issue Sep 27, 2022

BUG: td matrix #1373

Merged

MartinThoma added the is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbled characters and missed text. #1142

Garbled characters and missed text. #1142

Lightup1 commented Jul 21, 2022 •

edited

Loading

MartinThoma commented Jul 21, 2022

Lightup1 commented Jul 21, 2022

pubpub-zz commented Jul 22, 2022

pubpub-zz commented Jul 24, 2022

pubpub-zz commented Jul 26, 2022

pubpub-zz commented Jul 27, 2022

Lightup1 commented Jul 28, 2022

Garbled characters and missed text. #1142

Garbled characters and missed text. #1142

Comments

Lightup1 commented Jul 21, 2022 • edited Loading

Environment

Code + PDF

Original PDF snippets

TXT file finder

Example of garbled characters

MartinThoma commented Jul 21, 2022

Lightup1 commented Jul 21, 2022

pubpub-zz commented Jul 22, 2022

pubpub-zz commented Jul 24, 2022

pubpub-zz commented Jul 26, 2022

pubpub-zz commented Jul 27, 2022

Lightup1 commented Jul 28, 2022

Lightup1 commented Jul 21, 2022 •

edited

Loading