Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError raised when laparams set #383

Closed
alexreg opened this issue Mar 18, 2021 · 3 comments
Closed

KeyError raised when laparams set #383

alexreg opened this issue Mar 18, 2021 · 3 comments
Assignees
Labels

Comments

@alexreg
Copy link
Contributor

alexreg commented Mar 18, 2021

Describe the bug

I get the following exception when calling page.objects["char"]. Note, this only occurs when I open the PDF with laparams set.

  File "foo.py", line 30
    chars = left_column.objects["char"]
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/page.py", line 358, in objects
    self._objects = self.crop_fn(self.parent_page.objects, self.bbox)
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/page.py", line 358, in objects
    self._objects = self.crop_fn(self.parent_page.objects, self.bbox)
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 477, in crop_to_bbox
    return dict((k, crop_to_bbox(v, bbox)) for k, v in objs.items())
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 477, in <genexpr>
    return dict((k, crop_to_bbox(v, bbox)) for k, v in objs.items())
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 481, in crop_to_bbox
    cropped = list(filter(None, (clip_obj(obj, bbox) for obj in objs)))
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 481, in <genexpr>
    cropped = list(filter(None, (clip_obj(obj, bbox) for obj in objs)))
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 422, in clip_obj
    overlap = get_bbox_overlap(obj_to_bbox(obj), bbox)
KeyError: 'x0'

Code to reproduce the problem

with pdfplumber.open("serials.pdf", laparams = {}) as pdf:
	for page in pdf.pages:
		contents = page.crop(
			(
				Decimal(100),
				Decimal(70 + 200 if page.page_number == 1 else 0),
				page.width - Decimal(100),
				page.height - Decimal(70),
			),
		)
		left_column = contents.crop(
			(
				Decimal(0),
				Decimal(0),
				contents.width * Decimal(0.5),
				contents.height,
			),
			relative = True,
		)
		right_column = contents.crop(
			(
				contents.width * Decimal(0.5),
				Decimal(0),
				contents.width,
				contents.height,
			),
			relative = True,
		)

		chars = left_column.objects["char"]

PDF file

https://mathscinet.ams.org/msnhtml/serials.pdf

Expected behavior

No error (exception) should be raised.

Actual behavior

The above exception (KeyError) is raised.

Screenshots

left_column:
left_column

right_column:
right_column

Environment

  • pdfplumber version: 0.5.27
  • Python version: 3.9.2
  • OS: macOS 11

Additional context

None

@alexreg alexreg added the bug label Mar 18, 2021
@jsvine jsvine self-assigned this Mar 18, 2021
@jsvine
Copy link
Owner

jsvine commented Mar 18, 2021

Thanks for flagging. I'll take a look.

jsvine added a commit that referenced this issue Mar 19, 2021
pdfminer.six's `LTAnno` objects are not PDF annotations (which we
already provide access to via `.annots`, regardless of whether
`laparams` is set), but rather layout annotations. Per pdfminer.six
codebase:

> Note that, while a LTChar object has actual boundaries, LTAnno objects
> does not, as these are "virtual" characters, inserted by a layout
> analyzer according to the relationship between two characters (e.g. a
> space).

Because they have no boundaries, they cause problems for pdfplumber,
which expects bounding-box coordinates for all objects. See, e.g.,
issue #383, which this commit should fix.
@jsvine
Copy link
Owner

jsvine commented Mar 19, 2021

Thanks for flagging this, @alexreg. Commit/PR above should handle this. I'll close this issue when when/if the PR is merged.

@jsvine
Copy link
Owner

jsvine commented Aug 31, 2021

This was fixed by the PR above; belatedly closing this issue.

@jsvine jsvine closed this as completed Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants