-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot decode contents of annotations #463
Comments
@jsvine @samkit-jain # page.py@125
for k, v in extras.items():
if v is not None:
try:
extras[k] = v.decode('utf-8')
except UnicodeDecodeError:
extras[k] = v.decode('utf-16') |
Thank you for flagging this @tungph. I will look into this. |
Got the same problem trying to extract hyperlinks. The following code: def extract_urls(self):
with pdfplumber.open(self._file) as pdf:
return {uri_obj['uri'] for uri_obj in pdf.hyperlinks} raises an UnicodeDecodeError with this traceback:
Would highly appreciate a fix in upcoming versions! |
Thanks to @tungph for the fix proposal.
Thanks again @tungph for raising this issue and the suggested fix, and @devWhyqueue for seconding. The commit above should fix this once merged and should be available in the next release. |
Handle utf-16-encoded annotations (#463)
This fix is now part of the latest release, |
Describe the bug
While trying to get contents from an annotation, I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte
The content for the first annotation is in Japanese.
Code to reproduce the problem
PDF file
test.pdf
Expected behavior
Environment
Additional context
The content for the annotations is in both English and Japanese
The text was updated successfully, but these errors were encountered: