Skip to content

Commit

Permalink
BUG: layout mode text extraction ZeroDivisionError (#2417)
Browse files Browse the repository at this point in the history
For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0. 

Discovered during processing of a "pre-OCR'd" image PDF having `{"/BaseFont": "/GlyphLessFont"}`.

DOC: Remove duplicate docstring for layout_mode_strip_rotated
  • Loading branch information
shartzog authored Jan 21, 2024
1 parent facd6fd commit 9e494c6
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 4 deletions.
2 changes: 0 additions & 2 deletions pypdf/_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -2027,8 +2027,6 @@ def extract_text(
layout_mode_strip_rotated (bool): layout mode does not support rotated text.
Set to False to include rotated text anyway. If rotated text is discovered,
layout will be degraded and a warning will result. Defaults to True.
layout_mode_strip_rotated: Removes text that is rotated w.r.t. to the page from
layout mode output. Defaults to True.
layout_mode_debug_path (Path | None): if supplied, must target a directory.
creates the following files with debug information for layout mode
functions if supplied:
Expand Down
5 changes: 3 additions & 2 deletions pypdf/_text_extraction/_layout_mode/_fixed_width_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,9 @@ def recurs_to_target_op(
# multiply by bool (_idx != bt_idx) to ensure spaces aren't double
# applied to the first tj of a BTGroup in fixed_width_page().
excess_tx = round(_tj.tx - last_displaced_tx, 3) * (_idx != bt_idx)

new_text = f'{" " * int(excess_tx // _tj.space_tx)}{_tj.txt}'
# space_tx could be 0 if either Tz or font_size was 0 for this _tj.
spaces = int(excess_tx // _tj.space_tx) if _tj.space_tx else 0
new_text = f'{" " * spaces}{_tj.txt}'

last_ty = _tj.ty
_text = f"{_text}{new_text}"
Expand Down

0 comments on commit 9e494c6

Please sign in to comment.