tesseract fails to read simple numbers #4285

embeh · 2024-07-14T14:28:43Z

Current Behavior

I am using pytesseract (which calls /usr/bin/tesseract) to recognize numbers of a gas meter.
Unfortunately, this very often fails to read most numbers and is very unreliable.

The actual command to get the number string from the image is
pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')

Here is an example image (after some image processing):

When running this through tesseract (as described above), I just get "2734"... :-(

Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?

Expected Behavior

Correctly read the numbers. For the image example, this should be "4428734"

Suggested Fix

No response

tesseract -v

tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Operating System

No response

Other Operating System

Ubuntu 20

uname -a

Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

Ubuntu in WSL2

Other Information

No response

The text was updated successfully, but these errors were encountered:

embeh · 2024-07-14T15:48:36Z

OK, the page segmentation mode seems to be the issue here.

Replacing --psm 8 with --psm 7 produces much better results (so does --psm 11 but none of the others) - but I have no idea why.
PSM 8 is advertised as "single word...", isn't that what we have here?

DominicMukilan · 2024-07-16T11:00:54Z

Why not close the issue if it's resolved?

embeh · 2024-07-16T11:05:11Z

Well, I think psm 8 should be able to handle this, too, no?

v3ss0n · 2024-07-18T19:37:54Z

It is still an issue . Tessearact LSTM engine have very hard time reconizing very simple numbers while PaddlePaddleOCR Recongnize well.

here is the result

7% 7% 23
6 6 8

psm 8 dosen't help

Legacy engine improve for numbers but its totally screwed on alphabets.

uttaran-das · 2024-08-05T20:18:33Z

Hi @embeh , what kind of image processing techniques did you use?

embeh · 2024-08-06T10:56:42Z

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

uttaran-das · 2024-08-06T19:12:04Z

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

embeh · 2024-08-06T20:38:38Z

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference?

In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image.

amitdo · 2024-08-23T18:24:56Z

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator.

amitdo · 2024-08-23T19:24:00Z

tesseract 4.1.1 is too old and we don't support it.

You said you get a better result with psm 7, but you didn't provide the output with this psm.

embeh · 2024-08-23T20:09:15Z

tesseract 4.1.1 is too old and we don't support it.

OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).

You said you get a better result with psm 7, but you didn't provide the output with this psm.

--psm 7 produces the output "4428734"
--psm 8 produces the output "4L2B734"

Both were run on the identical image file.
You should be able to reproduce this by downloading the image above and run it through tesseract?

So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced".
I certainly could use image processing to squeeze them together some more but what makes me wonder is that psm 7 simply does the job without such hacks.

Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software...

embeh · 2024-08-23T20:21:02Z

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

I just did a test and manually moved the individual digits closer to each other (without changing any of the black pixels) :

...and you are correct! Now I get this:

--psm 7: "4428734"
--psm 8: "4428734"

So both report the same correct numbers only because the spacing. Interesting!

amitdo · 2024-08-23T21:34:20Z

For psm 8 with the first image, let's say there is a place for improvement...

Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers.

amitdo added the digits label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract fails to read simple numbers #4285

tesseract fails to read simple numbers #4285

embeh commented Jul 14, 2024 •

edited

Loading

embeh commented Jul 14, 2024 •

edited

Loading

DominicMukilan commented Jul 16, 2024

embeh commented Jul 16, 2024 •

edited

Loading

v3ss0n commented Jul 18, 2024 •

edited

Loading

uttaran-das commented Aug 5, 2024

embeh commented Aug 6, 2024

uttaran-das commented Aug 6, 2024

embeh commented Aug 6, 2024

amitdo commented Aug 23, 2024 •

edited

Loading

amitdo commented Aug 23, 2024

embeh commented Aug 23, 2024

embeh commented Aug 23, 2024 •

edited

Loading

amitdo commented Aug 23, 2024 •

edited

Loading

tesseract fails to read simple numbers #4285

tesseract fails to read simple numbers #4285

Comments

embeh commented Jul 14, 2024 • edited Loading

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

embeh commented Jul 14, 2024 • edited Loading

DominicMukilan commented Jul 16, 2024

embeh commented Jul 16, 2024 • edited Loading

v3ss0n commented Jul 18, 2024 • edited Loading

uttaran-das commented Aug 5, 2024

embeh commented Aug 6, 2024

uttaran-das commented Aug 6, 2024

embeh commented Aug 6, 2024

amitdo commented Aug 23, 2024 • edited Loading

amitdo commented Aug 23, 2024

embeh commented Aug 23, 2024

embeh commented Aug 23, 2024 • edited Loading

amitdo commented Aug 23, 2024 • edited Loading

embeh commented Jul 14, 2024 •

edited

Loading

embeh commented Jul 14, 2024 •

edited

Loading

embeh commented Jul 16, 2024 •

edited

Loading

v3ss0n commented Jul 18, 2024 •

edited

Loading

amitdo commented Aug 23, 2024 •

edited

Loading

embeh commented Aug 23, 2024 •

edited

Loading

amitdo commented Aug 23, 2024 •

edited

Loading