dedupe_chars() method get error #842

154192 · 2023-03-21T03:33:01Z

300218_2011.pdf

jsvine · 2023-03-21T20:06:08Z

Thanks for raising this issue @154192. The intermediate issue appears to be that the fontnames for some some of the PDF's characters are being read as bytes — e.g., b'RGJSAP+\xcb\xce\xcc\xe5' — instead of strings. I'm not yet sure whether this is an issue with the PDF itself or with pdfminer.six, the library pdfplumber uses as its PDF parser. I hope to take a closer look soon.

Came across this bit of code, which helps to solve some of the mystery in issues #461 and #842: https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774 Now, for every char's fontname, we: - Check whether its a `str` or `byte` - If the latter, we check whether it's one of the well-known codes from the link above - If so, we use that (preserving the part, if present, before the `+`) - If not, we just cast to str

jsvine · 2023-04-13T13:14:25Z

This should now be fixed in v0.9.0, but let me know if it's still not working for you.

154192 added the bug label Mar 21, 2023

jsvine mentioned this issue Apr 13, 2023

v0.9.0 #862

Merged

jsvine closed this as completed Apr 13, 2023

dependabot bot mentioned this issue May 1, 2023

Bump pdfplumber from 0.7.6 to 0.9.0 AdamGagorik/bany#39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedupe_chars() method get error #842

dedupe_chars() method get error #842

154192 commented Mar 21, 2023

jsvine commented Mar 21, 2023

jsvine commented Apr 13, 2023

dedupe_chars() method get error #842

dedupe_chars() method get error #842

Comments

154192 commented Mar 21, 2023

jsvine commented Mar 21, 2023

jsvine commented Apr 13, 2023