You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When passing boxes_flow as None, we don't run the full advanced layout analysis, but rather the order of text boxes will depend on their position on the page only. This is intentional.
If we were passing boxes flow, we'd group the text boxes and then call analyze on each group (here). This filters down so that analyze is called on the text boxes themselves. When boxes_flow=None, we don't call analyze on the text boxes, which results in the lines coming in the wrong order as they don't get sorted.
Note that boxes_flow is not used in the analyze method of text boxes, it is only used for groups of text boxes (which we never have if boxes flow is disabled).
To fix this, we just need to make sure that analyze is always called on the text boxes, even if we don't group them.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LAParams
la_params = LAParams(boxes_flow=None)
for page in extract_pages("example.pdf", laparams=la_params, caching=False):
print("*****OUTPUT:*****")
for element in page:
print(element)
Bug report
When passing
boxes_flow
asNone
, we don't run the full advanced layout analysis, but rather the order of text boxes will depend on their position on the page only. This is intentional.If we were passing boxes flow, we'd group the text boxes and then call
analyze
on each group (here). This filters down so thatanalyze
is called on the text boxes themselves. Whenboxes_flow=None
, we don't callanalyze
on the text boxes, which results in the lines coming in the wrong order as they don't get sorted.Note that
boxes_flow
is not used in theanalyze
method of text boxes, it is only used for groups of text boxes (which we never have if boxes flow is disabled).To fix this, we just need to make sure that
analyze
is always called on the text boxes, even if we don't group them.Example PDF
Example code:
Example output:
Expected output:
The text was updated successfully, but these errors were encountered: