Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizing underlying pdfminer laparams fails #168

Closed
frascuchon opened this issue Jan 8, 2020 · 1 comment
Closed

Customizing underlying pdfminer laparams fails #168

frascuchon opened this issue Jan 8, 2020 · 1 comment

Comments

@frascuchon
Copy link

I'm trying to customize the underlying pdfminer text extract process for better tune my text extraction pipeline by setting some LAParams values falling into an error:

Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/biome/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/local/anaconda3/envs/biome/lib/python3.7/unittest/case.py", line 615, in run
    testMethod()
  File "/Users/frascuchon/recognai/pdfplumber/tests/test_laparams_customization.py", line 20, in test_load_with_custom_laparams
    print(first_page.chars)
  File "/Users/frascuchon/recognai/pdfplumber/pdfplumber/container.py", line 35, in chars
    return self.objects.get("char", [])
  File "/Users/frascuchon/recognai/pdfplumber/pdfplumber/page.py", line 66, in objects
    self._objects = self.parse_objects()
  File "/Users/frascuchon/recognai/pdfplumber/pdfplumber/page.py", line 167, in parse_objects
    process_object(obj)
  File "/Users/frascuchon/recognai/pdfplumber/pdfplumber/page.py", line 140, in process_object
    for k, v in obj.__dict__.items()
  File "/Users/frascuchon/recognai/pdfplumber/pdfplumber/page.py", line 141, in <genexpr>
    if k not in IGNORE)
KeyError: 'index'

You can check on your own with this code snippet:

import pdfplumber

with pdfplumber.open(path, laparams=dict(line_margin=0.2)) as pdf:
    print(f"Found {len(pdf.pages)} pages")
    first_page = pdf.pages[0]
    print(first_page.chars)
jsvine added a commit that referenced this issue Jan 13, 2020
Issue #168 / PR #169. Many thanks to @frascuchon for submitting the PR,
which is the source for the code/idea in this commit.
@jsvine
Copy link
Owner

jsvine commented Jan 13, 2020

Fixed in df00787 and now available in v0.5.16.

@jsvine jsvine closed this as completed Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants