You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current wrapper implementation only provides access to the page->text method results.
There is a similar text_list method in the original Poppler code (since version 0.63.0?) which provides access to single words and their bounding boxes. With this, functionality like selecting a clipping region, re-ordering the text or filtering too small text can be achieved. This roughly corresponds to the -bbox option of the CLI.
It would be great if the Python wrapper could provide access to the words with their bounding boxes for further post-processing.
The text was updated successfully, but these errors were encountered:
This is somewhat tangential, but you can use the python-poppler package to achieve this (though admittedly it was a bit unclear at first how to do it).
The code would be something along the lines of:
frompopplerimportload_from_file# load the filepdf=load_from_file("somefile.pdf")
# argument can be either 0-based index or a "page label" (whatever the latter is)# note that this doesn't really "create" a page (in the sense of modifying the# original or a copy of the PDF), it simply returns a `Page` objectpage=pdf.create_page(0)
# go over the text listforiteminpage.text_list():
print(item.bbox.as_tuple(),item.text)
# getting text from some (rectangular) regionfrompopplerimportRectangletext_in_region=page.text(Rectangle(x, y, width, height))
The current wrapper implementation only provides access to the
page->text
method results.There is a similar
text_list
method in the original Poppler code (since version 0.63.0?) which provides access to single words and their bounding boxes. With this, functionality like selecting a clipping region, re-ordering the text or filtering too small text can be achieved. This roughly corresponds to the-bbox
option of the CLI.It would be great if the Python wrapper could provide access to the words with their bounding boxes for further post-processing.
The text was updated successfully, but these errors were encountered: