Provide access to page::text_list #111

stefan6419846 · 2023-02-16T07:45:03Z

The current wrapper implementation only provides access to the page->text method results.

There is a similar text_list method in the original Poppler code (since version 0.63.0?) which provides access to single words and their bounding boxes. With this, functionality like selecting a clipping region, re-ordering the text or filtering too small text can be achieved. This roughly corresponds to the -bbox option of the CLI.

It would be great if the Python wrapper could provide access to the words with their bounding boxes for further post-processing.

The text was updated successfully, but these errors were encountered:

JCGoran · 2023-06-03T18:52:12Z

This is somewhat tangential, but you can use the python-poppler package to achieve this (though admittedly it was a bit unclear at first how to do it).
The code would be something along the lines of:

from poppler import load_from_file

# load the file
pdf = load_from_file("somefile.pdf")

# argument can be either 0-based index or a "page label" (whatever the latter is)
# note that this doesn't really "create" a page (in the sense of modifying the
# original or a copy of the PDF), it simply returns a `Page` object
page = pdf.create_page(0) 

# go over the text list
for item in page.text_list():
    print(item.bbox.as_tuple(),item.text)

# getting text from some (rectangular) region
from poppler import Rectangle
text_in_region = page.text(Rectangle(x, y, width, height))

stefan6419846 mentioned this issue Feb 16, 2023

question about how to approach bonding box problem #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide access to page::text_list #111

Provide access to page::text_list #111

stefan6419846 commented Feb 16, 2023

JCGoran commented Jun 3, 2023

Provide access to page::text_list #111

Provide access to page::text_list #111

Comments

stefan6419846 commented Feb 16, 2023

JCGoran commented Jun 3, 2023