Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text(extra_attrs=["size"]) raises a parsing error #1030

Closed
RitaMarques opened this issue Oct 27, 2023 · 2 comments
Closed

extract_text(extra_attrs=["size"]) raises a parsing error #1030

RitaMarques opened this issue Oct 27, 2023 · 2 comments
Labels

Comments

@RitaMarques
Copy link

Describe the bug

While reading a simple PDF using the method extract_text, passing the list ["size", "fontname"] to extra_attrs, it raises the error:

    404 def extract_text(self, **kwargs: Any) -> str:
--> 405     return self.get_textmap(**kwargs).as_string

TypeError: unhashable type: 'list'

Code to reproduce the problem

import pdfplumber

with pdfplumber.open("Condioes_Gerais_Abertura_Conta.pdf") as pdf:
    page = pdf.pages[0]
    print(page.extract_text(layout=True, use_text_flow=True, extra_attrs=["size", "fontname"]))

PDF file

Condioes_Gerais_Abertura_Conta.pdf

Screenshots

image

Environment

  • pdfplumber version: 0.10.2
  • Python version: 3.10.7
  • OS: Windows
@RitaMarques RitaMarques changed the title extract_text(extra_attrs=["size"]) raises a parsing error extract_text(extra_attrs=["size"]) raises a parsing error Oct 27, 2023
@jsvine
Copy link
Owner

jsvine commented Oct 27, 2023

Hi @RitaMarques, and thanks for flagging this, which was indeed a bug.

Although there had been a test for Page.extract_words(extra_args=[...]), there wasn't yet one for Page.extract_text(extra_args=[...]), and the addition of a caching layer caused this error to be thrown, since list kwargs can't be hashed for the cache.

This is now solved in 0bfffc2 by pre-processing the kwargs to convert lists into tuples.

For now (i.e., before the next release), you can solve your problem by defining the extra_attrs as a tuple instead of a list:

page.extract_text(
  layout=True,
  use_text_flow=True,
  extra_attrs=("size", "fontname")
)

Let us know if that doesn't work for you.

@jsvine jsvine closed this as completed Oct 27, 2023
@RitaMarques
Copy link
Author

Hi @jsvine, thanks for getting back to me!
It's solved ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants