Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled characters and missed text. #1142

Closed
Lightup1 opened this issue Jul 21, 2022 · 7 comments
Closed

Garbled characters and missed text. #1142

Lightup1 opened this issue Jul 21, 2022 · 7 comments
Labels
Difficulty: High help wanted We appreciate help everywhere - this one might be an easy start! is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean)

Comments

@Lightup1
Copy link

Lightup1 commented Jul 21, 2022

There are garbled characters and some text, which can be copied inside a viewer, can not be found in the exctract_text()

Environment

$ python -m platform
Windows-10-10.0.22000-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.6.0 

Code + PDF

This is a minimal, complete example that shows the issue (with #1136 to fix keyerror "\W"):

import os
import PyPDF2
pdf_filename="ST辅仁:2019年年度报告.PDF"
txt_filename="ST辅仁:2019年年度报告.txt"
with open(pdf_filename, 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfReader(pdfFileObj)
        x=pdfReader.numPages
        with open(txt_filename,"a",encoding='utf-16') as file1:
                for i in range(x):
                    pageobj=pdfReader.pages[i]
                    text=pageobj.extract_text()
                    file1.writelines(text)

ST辅仁:2019年年度报告.PDF

Original PDF snippets

page 7/254:
image
copied from pdf file: 第二季度

TXT file finder

image

The txt editor told that there is no "第二季度" inside the txt file. Notice that it's only one example in this PDF file.

Example of garbled characters

�]
-Añ-��J �7 �]
-Añ�h1Ñ*6��
ˆ�J
�:�Ô�p �7 �:#§Añ�h�Ô�C�p
�\ œ
(�à œ
( �7 Eµ�ñ9Ÿ�JLö
��f9Ÿ6Ñ�-�9L€ œ
(

Eµ�ñ�2 �7 "ã ‡Eµ�ñ�2�f9Ÿ�9L€ œ
(
Eµ�ñLö
� �7 Eµ�ñ9Ÿ�JLö
��9L€ œ
(
�09ŸLö
� �7 �0�1�f9Ÿ�ÄLö
��Å�9L€ œ
(
�»"ãG‚�J �7 "ã ‡-1�»"ãG‚�J6Ñ�-�9L€ œ

here is the txt extracted:
ST辅仁:2019年年度报告.txt

@MartinThoma
Copy link
Member

As a side-note: You're running the current main of PyPDF2 on Github, not the PyPI version, right?

@Lightup1
Copy link
Author

Yes it's running on current main of PyPDF2 on Github and this issue is also reproducible on PyPDF2 2.5.0 as I tested

@pubpub-zz
Copy link
Collaborator

the PDF is a tagged PDF. the font is not explicitly reset to chinese : I guess it is using the "tagging" solution but I've not yet understand it.
If someone has some experience in, help would be welcomed

@pubpub-zz
Copy link
Collaborator

@MartinThoma , can you tag this issue "help wanted" / "difficult high"

@MartinThoma MartinThoma added help wanted We appreciate help everywhere - this one might be an easy start! Difficulty: High labels Jul 24, 2022
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 26, 2022
Graphic state shall store also the font, font size,....
@pubpub-zz
Copy link
Collaborator

@Lightup1
Got it!
I've checked the rendering in pdf.js and it was good. It has been quite easy to add some debug in pdf.js to trace how the font was restored : actually it is not linked to tagged pdf ; the issue was in the saved graphic context/state that was not saving the text context (that includes the font name/size)

thanks to push me into creating a new way to carry on investigation using pdf.js😀

MartinThoma pushed a commit that referenced this issue Jul 27, 2022
Graphic state shall store also the font, font size, ...

See #1142
@pubpub-zz
Copy link
Collaborator

@Lightup1,
do you want to do some extra-tests before closing ?

@Lightup1
Copy link
Author

Hi @pubpub-zz , thanks for your work. It’s okay to close I think.

@MartinThoma MartinThoma added the is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean) label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: High help wanted We appreciate help everywhere - this one might be an easy start! is-cjk-issue Issue related to CJK (Chinese-Japanese-Korean)
Projects
None yet
Development

No branches or pull requests

3 participants