Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot decode contents of annotations #463

Closed
tungph opened this issue Jul 5, 2021 · 5 comments
Closed

Cannot decode contents of annotations #463

tungph opened this issue Jul 5, 2021 · 5 comments
Assignees
Labels

Comments

@tungph
Copy link

tungph commented Jul 5, 2021

Describe the bug

While trying to get contents from an annotation, I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

The content for the first annotation is in Japanese.

Code to reproduce the problem

    with pdfplumber.open('test.pdf') as pdf:
        print(pdf.annots)

PDF file

test.pdf

Expected behavior

{'uri': None, 'title': 'tung.phanhuy', 'contents': '日本語'}
{'uri': None, 'title': None, 'contents': None}
{'uri': None, 'title': 'tung.phanhuy', 'contents': '"well"'}
{'uri': None, 'title': None, 'contents': None}
{'uri': None, 'title': 'tung.phanhuy', 'contents': 'table'}
{'uri': None, 'title': None, 'contents': None}

Environment

  • pdfplumber version: 0.5.28
  • Python version: 3.9.5
  • OS: Mac

Additional context

The content for the annotations is in both English and Japanese

@tungph tungph added the bug label Jul 5, 2021
@tungph
Copy link
Author

tungph commented Jul 5, 2021

@jsvine @samkit-jain
To bypass the problem, I try to decode with utf-16 if utf-8 fail:

    # page.py@125

    for k, v in extras.items():
        if v is not None:
            try:
                extras[k] = v.decode('utf-8')
            except UnicodeDecodeError:
                extras[k] = v.decode('utf-16')

@tungph tungph closed this as completed Jul 5, 2021
@tungph tungph reopened this Jul 5, 2021
@jsvine jsvine self-assigned this Jul 10, 2021
@jsvine
Copy link
Owner

jsvine commented Jul 10, 2021

Thank you for flagging this @tungph. I will look into this.

@devWhyqueue
Copy link

devWhyqueue commented Sep 18, 2021

Got the same problem trying to extract hyperlinks.

The following code:

def extract_urls(self):
    with pdfplumber.open(self._file) as pdf:
        return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}

raises an UnicodeDecodeError with this traceback:

 File "...", line 184, in extract_urls
    return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
  File ".../pdf.py", line 98, in hyperlinks
    return list(itertools.chain(*gen))
  File ".../pdf.py", line 97, in <genexpr>
    gen = (p.hyperlinks for p in self.pages)
  File ".../page.py", line 155, in hyperlinks
    return [a for a in self.annots if a["uri"] is not None]
  File ".../page.py", line 151, in annots
    return list(map(parse, raw))
  File ".../page.py", line 127, in parse
    extras[k] = v.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 8: invalid start byte

Would highly appreciate a fix in upcoming versions!

jsvine added a commit that referenced this issue Oct 15, 2021
Thanks to @tungph for the fix proposal.
@jsvine
Copy link
Owner

jsvine commented Oct 15, 2021

Thanks again @tungph for raising this issue and the suggested fix, and @devWhyqueue for seconding. The commit above should fix this once merged and should be available in the next release.

jsvine added a commit that referenced this issue Oct 20, 2021
Handle utf-16-encoded annotations (#463)
@jsvine
Copy link
Owner

jsvine commented Dec 24, 2021

This fix is now part of the latest release, v0.6.0.

@jsvine jsvine closed this as completed Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants