-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbled characters and missed text. #1142
Comments
As a side-note: You're running the current |
Yes it's running on current |
the PDF is a tagged PDF. the font is not explicitly reset to chinese : I guess it is using the "tagging" solution but I've not yet understand it. |
@MartinThoma , can you tag this issue "help wanted" / "difficult high" |
Graphic state shall store also the font, font size,....
@Lightup1 thanks to push me into creating a new way to carry on investigation using pdf.js😀 |
Graphic state shall store also the font, font size, ... See #1142
@Lightup1, |
Hi @pubpub-zz , thanks for your work. It’s okay to close I think. |
There are garbled characters and some text, which can be copied inside a viewer, can not be found in the
exctract_text()
Environment
$ python -m platform Windows-10-10.0.22000-SP0 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.6.0
Code + PDF
This is a minimal, complete example that shows the issue (with #1136 to fix keyerror "\W"):
ST辅仁:2019年年度报告.PDF
Original PDF snippets
page 7/254:
copied from pdf file: 第二季度
TXT file finder
The txt editor told that there is no "第二季度" inside the txt file. Notice that it's only one example in this PDF file.
Example of garbled characters
�]
-Añ-��J �7 �]
-Añ�h1Ñ*6��
ˆ�J
�:�Ô�p �7 �:#§Añ�h�Ô�C�p
�\ œ
(�à œ
( �7 Eµ�ñ9Ÿ�JLö
��f9Ÿ6Ñ�-�9L€ œ
(
Eµ�ñ�2 �7 "ã ‡Eµ�ñ�2�f9Ÿ�9L€ œ
(
Eµ�ñLö
� �7 Eµ�ñ9Ÿ�JLö
��9L€ œ
(
�09ŸLö
� �7 �0�1�f9Ÿ�ÄLö
��Å�9L€ œ
(
�»"ãG‚�J �7 "ã ‡-1�»"ãG‚�J6Ñ�-�9L€ œ
here is the txt extracted:
ST辅仁:2019年年度报告.txt
The text was updated successfully, but these errors were encountered: