You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using certain PSMs with certain inputs, the PageIterator::Baseline function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when using psm8 (single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr).
Reproducible Example
While this is most noticeable using it->Baseline through the API, the phenomenon can be demonstrated using the CLI with the example image below.
The word in the image is recognized correctly--including having the same bounding box--whether psm is set to 6 (single block) or 8 (single word). However, the latter does not calculate the baseline correctly.
When setting psm to 6, the baseline attribute is set to -0.036 0, which is correct.
I investigated, and the root cause is that the PageIterator::Baseline function assumes that the line's bounding box has already been calculated, however this is not always the case. The PageIterator::Baseline gets the line's bounding box using row->bounding_box(), which does not force these values to be calculated--it simply returns the default values (-32767 or 32767) if they were not calculated already.
Overview
When using certain PSMs with certain inputs, the
PageIterator::Baseline
function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when usingpsm
8
(single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr
).Reproducible Example
While this is most noticeable using
it->Baseline
through the API, the phenomenon can be demonstrated using the CLI with the example image below.The word in the image is recognized correctly--including having the same bounding box--whether
psm
is set to6
(single block) or8
(single word). However, the latter does not calculate the baseline correctly.When setting
psm
to6
, the baseline attribute is set to-0.036 0
, which is correct.However, when setting
psm
to8
the baseline attribute is set to-0 -2.005
, which is incorrect.Cause
I investigated, and the root cause is that the
PageIterator::Baseline
function assumes that the line's bounding box has already been calculated, however this is not always the case. ThePageIterator::Baseline
gets the line's bounding box usingrow->bounding_box()
, which does not force these values to be calculated--it simply returns the default values (-32767
or32767
) if they were not calculated already.tesseract/src/ccmain/pageiterator.cpp
Lines 534 to 542 in 215b023
This can be confirmed by adding
tprintf
statements within thePageIterator::Baseline
function:When run with
psm
set to8
this produces the following:Potential Fixes
I think there are 3 potential approaches for fixing:
PageIterator::Baseline
for whether the default value is being returned, and if it is, calculate the actual bounding box.row->bounding_box()
function to calculate the bounding box if it has never been calculated before.psm
settings, and edit so they are being calculated upon creation.Environment
Ubuntu 22.04 Jammy
The text was updated successfully, but these errors were encountered: