Skip to content

Commit

Permalink
Merge pull request #519 from jsvine/issue/463
Browse files Browse the repository at this point in the history
Handle utf-16-encoded annotations (#463)
  • Loading branch information
jsvine authored Oct 20, 2021
2 parents e1d851a + df98f9c commit 0c30a53
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ All notable changes to this project will be documented in this file. The format

### Fixed
- Fix slowdown in `.extract_words(...)`/`WordExtractor.iter_chars_to_words(...)` on very long words, caused by repeatedly re-calculating bounding box. ([#483](https://github.com/jsvine/pdfplumber/discussions/483))
- Handle `UnicodeDecodeError` when trying to decode utf-16-encoded annotations ([#463](https://github.com/jsvine/pdfplumber/issues/463)) [h/t @tungph]

### Development Changes
- Add `CONTRIBUTING.md` ([#428](https://github.com/jsvine/pdfplumber/pull/428))
Expand Down
5 changes: 4 additions & 1 deletion pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,10 @@ def parse(annot):
}
for k, v in extras.items():
if v is not None:
extras[k] = v.decode("utf-8")
try:
extras[k] = v.decode("utf-8")
except UnicodeDecodeError:
extras[k] = v.decode("utf-16")

parsed = {
"page_number": self.page_number,
Expand Down
Binary file added tests/pdfs/issue-463-example.pdf
Binary file not shown.
9 changes: 9 additions & 0 deletions tests/test_issues.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,3 +187,12 @@ def test_issue_386(self):
with pdfplumber.open(path) as pdf:
chars = (char for char in pdf.chars)
pdfplumber.utils.extract_text(chars)

def test_issue_463(self):
"""
Extracting annotations should not raise UnicodeDecodeError on utf-16 text
"""
path = os.path.join(HERE, "pdfs/issue-463-example.pdf")
with pdfplumber.open(path) as pdf:
annots = pdf.annots
annots[0]["contents"] == "日本語"

0 comments on commit 0c30a53

Please sign in to comment.