Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide access to page::text_list #111

Open
stefan6419846 opened this issue Feb 16, 2023 · 1 comment
Open

Provide access to page::text_list #111

stefan6419846 opened this issue Feb 16, 2023 · 1 comment

Comments

@stefan6419846
Copy link

The current wrapper implementation only provides access to the page->text method results.

There is a similar text_list method in the original Poppler code (since version 0.63.0?) which provides access to single words and their bounding boxes. With this, functionality like selecting a clipping region, re-ordering the text or filtering too small text can be achieved. This roughly corresponds to the -bbox option of the CLI.

It would be great if the Python wrapper could provide access to the words with their bounding boxes for further post-processing.

@JCGoran
Copy link

JCGoran commented Jun 3, 2023

This is somewhat tangential, but you can use the python-poppler package to achieve this (though admittedly it was a bit unclear at first how to do it).
The code would be something along the lines of:

from poppler import load_from_file

# load the file
pdf = load_from_file("somefile.pdf")

# argument can be either 0-based index or a "page label" (whatever the latter is)
# note that this doesn't really "create" a page (in the sense of modifying the
# original or a copy of the PDF), it simply returns a `Page` object
page = pdf.create_page(0) 

# go over the text list
for item in page.text_list():
    print(item.bbox.as_tuple(),item.text)

# getting text from some (rectangular) region
from poppler import Rectangle
text_in_region = page.text(Rectangle(x, y, width, height))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants