-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode to wrong character #607
Comments
I believe this is some kind of issue with the PDF generating software HiQPdf 11.1. When I opened the sample PDF in Adobe Acrobat, deleted one letter (the "o" in "month" from "Ngày (Date) 10 tháng (month) 05 năm (year) 2023") and typed it back in, then saved the file, it runs through PdfParser just fine, handling every Đ like a champ. The same thing occurs when editing no text, but saving the PDF as a "Reduced size PDF" using Adobe Acrobat. All letters get interpreted correctly. I think there's something to be said for PdfParser to be able to handle mis-coded[1] text (because Adobe sure can, just by loading it) but I don't think this is a bug in PdfParser per se. Probably more like an enhancement to handle this kind of situation.
|
@huynq55 can you check that again please? If its the case what @GreyWyvern said, I am for closing this issue. |
I don't think this should be closed, because Adobe can open the file and read the characters properly, so PdfParser should be able to do that too. Just we can be sure that Adobe and HiQPdf 11.1 are saving these bytes (or fonts?) in different ways. It's a first place to look. |
Just returning to this to see if it's affected by PR 614. This particular issue is not a font issue, but an issue with the way HiQPDF 11.1 is saving the bytes. Where in 614, whole blocks of text were being assigned the wrong character map, in this case substrings within correctly encoded blocks are being saved in a weird way. This is one such block. Parts of it are encoded fine, but a central portion has been changed. Adobe can read this, so there is some way to correct it, but it doesn't have anything to do with incorrectly specified fonts.
|
The sample file 1C23TAZ_0000178321.pdf is now extracting properly in the latest release v2.7.0 and was probably fixed by #597. |
Description:
In my pdf file, there are strings with the character Đ, for example: "Địa chỉ". The characters 'ị' 'a' 'c' 'h' 'ỉ' are encoded with 2 bytes, while the character 'Đ' is encoded with 3 bytes. I learned this through checking the pdf file. However, I don't understand why the character 'Đ' is encoded with 3 bytes. pdfparser didn't detect this and therefore decodes 2 bytes at a time, resulting in incorrect decoding for all the characters.
PDF input
1C23TAZ_0000178321.pdf
Expected output & actual output
Expected output: Địa chỉ
Actual output: non-readable text
Bytes sequence: 01 5c 62 04 cf 00 44 00 03 00 46 00 4b 04 cd
01 5c 62 => can't decode
04 cf => ị
00 44 => a
00 03 => space
00 46 => c
00 4b => h
04 cd => ỉ
Code
$parser = new Parser();
$document = $parser->parseFile($file);
$data = $document->getText();
The text was updated successfully, but these errors were encountered: