Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedupe_chars() method get error #842

Closed
154192 opened this issue Mar 21, 2023 · 2 comments
Closed

dedupe_chars() method get error #842

154192 opened this issue Mar 21, 2023 · 2 comments
Labels

Comments

@154192
Copy link

154192 commented Mar 21, 2023

image
300218_2011.pdf

@154192 154192 added the bug label Mar 21, 2023
@jsvine
Copy link
Owner

jsvine commented Mar 21, 2023

Thanks for raising this issue @154192. The intermediate issue appears to be that the fontnames for some some of the PDF's characters are being read as bytes — e.g., b'RGJSAP+\xcb\xce\xcc\xe5' — instead of strings. I'm not yet sure whether this is an issue with the PDF itself or with pdfminer.six, the library pdfplumber uses as its PDF parser. I hope to take a closer look soon.

jsvine added a commit that referenced this issue Apr 13, 2023
Came across this bit of code, which helps to solve some of the mystery
in issues #461 and #842:

https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774

Now, for every char's fontname, we:

- Check whether its a `str` or `byte`
    - If the latter, we check whether it's one of the well-known codes from
      the link above
        - If so, we use that (preserving the part, if present, before
          the `+`)
        - If not, we just cast to str
@jsvine jsvine mentioned this issue Apr 13, 2023
@jsvine
Copy link
Owner

jsvine commented Apr 13, 2023

This should now be fixed in v0.9.0, but let me know if it's still not working for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants