`extract_text(extra_attrs=["size"])` raises a parsing error #1030

RitaMarques · 2023-10-27T11:13:11Z

Describe the bug

While reading a simple PDF using the method extract_text, passing the list ["size", "fontname"] to extra_attrs, it raises the error:

    404 def extract_text(self, **kwargs: Any) -> str:
--> 405     return self.get_textmap(**kwargs).as_string

TypeError: unhashable type: 'list'

Code to reproduce the problem

import pdfplumber

with pdfplumber.open("Condioes_Gerais_Abertura_Conta.pdf") as pdf:
    page = pdf.pages[0]
    print(page.extract_text(layout=True, use_text_flow=True, extra_attrs=["size", "fontname"]))

PDF file

Condioes_Gerais_Abertura_Conta.pdf

Screenshots

Environment

pdfplumber version: 0.10.2
Python version: 3.10.7
OS: Windows

The text was updated successfully, but these errors were encountered:

jsvine · 2023-10-27T14:57:15Z

Hi @RitaMarques, and thanks for flagging this, which was indeed a bug.

Although there had been a test for Page.extract_words(extra_args=[...]), there wasn't yet one for Page.extract_text(extra_args=[...]), and the addition of a caching layer caused this error to be thrown, since list kwargs can't be hashed for the cache.

This is now solved in 0bfffc2 by pre-processing the kwargs to convert lists into tuples.

For now (i.e., before the next release), you can solve your problem by defining the extra_attrs as a tuple instead of a list:

page.extract_text(
  layout=True,
  use_text_flow=True,
  extra_attrs=("size", "fontname")
)

Let us know if that doesn't work for you.

RitaMarques · 2023-10-27T15:03:49Z

Hi @jsvine, thanks for getting back to me!
It's solved ;)

RitaMarques added the bug label Oct 27, 2023

RitaMarques changed the title ~~extract_text(extra_attrs=["size"]) raises a parsing error~~ extract_text(extra_attrs=["size"]) raises a parsing error Oct 27, 2023

jsvine closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`extract_text(extra_attrs=["size"])` raises a parsing error #1030

`extract_text(extra_attrs=["size"])` raises a parsing error #1030

RitaMarques commented Oct 27, 2023

jsvine commented Oct 27, 2023

RitaMarques commented Oct 27, 2023

extract_text(extra_attrs=["size"]) raises a parsing error #1030

extract_text(extra_attrs=["size"]) raises a parsing error #1030

Comments

RitaMarques commented Oct 27, 2023

Describe the bug

Code to reproduce the problem

PDF file

Screenshots

Environment

jsvine commented Oct 27, 2023

RitaMarques commented Oct 27, 2023

`extract_text(extra_attrs=["size"])` raises a parsing error #1030

`extract_text(extra_attrs=["size"])` raises a parsing error #1030