tesseract process never finishes with specific gif image #3369

wix-andriusb · 2021-03-29T10:21:57Z

Environment

tesseract 4.1.1

reproduced on macosx and linux

uname -a
Darwin VL-C02WL1AYHTD6 19.6.0 Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64 x86_64
Linux ocr-5b7bf86f6-f6qsd 5.4.65-wix #1 SMP Thu Nov 19 15:24:12 UTC 2020 x86_64 GNU/Linux

Current Behavior:

running tesseract in command line on this image https://bentkus.eu/ocr_while_true.gif does not finish after 1h

tesseract ocr_while_true.gif ocr_while_true --dpi 150

Expected Behavior:

process should finish in 2 minutes

Suggested Fix:

I'll try to build and see why it never stops

upd. (by @egorpugin):
test png - https://bentkus.eu/ocr_while_loop.png

The text was updated successfully, but these errors were encountered:

stweil · 2021-03-29T10:35:28Z

That GIF file is special: it includes 125 images. How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?

amitdo · 2021-03-29T10:38:08Z

This is a gif animation.

Convert it to static images and give them to
tesserct as input.

stweil · 2021-03-29T10:38:55Z

Other issues where OCR never finishes: #2196, #2288.

stweil · 2021-03-29T10:42:17Z

Convert it to static images [...]

The static images work fine. Nevertheless handling of animated GIF images has to be well defined, see my question above.

wix-andriusb · 2021-03-29T10:43:47Z

Ok, thanks for the advice, I should handle this on my side, check for gif and slice and analyze it.
I tried other gifs and saw it finishing so I assumed that this should work too on this gif.

wix-andriusb · 2021-03-29T10:44:54Z

Can you maybe tell me what tool you used to create the static images?

egorpugin · 2021-03-29T10:45:10Z

How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?

My answer is 'Create OCR for all images'

stweil · 2021-03-29T10:46:55Z

Can you maybe tell me what tool you used to create the static images?

I used convert FROM.gif TO.png.

wix-andriusb · 2021-03-29T10:52:22Z

My answer is 'Create OCR for all images'

Could be configurable through arguments with the default being do OCR for all images

egorpugin · 2021-03-29T10:55:12Z

It's should be something like cat 1.txt 2.txt 3.txt ...
When we pass multiple images, they all must be processed. Same for multipage images (.gif, .tiff) if such format is enabled in leptonica.

stweil · 2021-03-29T10:58:28Z

So the handling of animated GIF should be similar to multipage TIFF (which either processes all pages or a selected page as far as I remember).

Maybe in a first step throwing an "unimplemented" error is easier. I am not sure how Leptonica supports animated GIF.

stweil · 2021-03-29T11:03:41Z

The static images work fine.

I was mistaken. Not all static images work fine. The first one which looks empty ~~does not terminate~~ requires more than 4 minutes.

amitdo · 2021-03-29T11:05:32Z

We depend on Leptonica for image IO. Can it handle gif animation? @DanBloomberg

What we need from Leptonica:

Detect if the image contains animatation.
If it does, make it possible to iterate over the images (frames) in the file, returning one pix at each iteration.

This way we can treat it like we treat multi-page tiff.

wix-andriusb · 2021-03-29T11:12:08Z

this image is the offender, a blank page with specific color

amitdo · 2021-03-29T11:14:41Z

The first one which looks empty does not terminate.

So Leptonica probably only sees the first image and returns it as pix.

amitdo · 2021-03-29T11:19:31Z

Please attach the first image.

egorpugin · 2021-03-29T11:35:19Z

I've recorded first N GBs of debug logs in the infinite loop.

Smooothing part at:Bounding box=(-1888,1064)->(-1884,1067)
Smooothing part at:Bounding box=(-1886,1066)->(-1882,1069)
Smooothing part at:Bounding box=(-1884,1067)->(-1879,1071)
Smooothing part at:Bounding box=(-1874,1074)->(-1870,1077)
Smooothing part at:Bounding box=(-1825,1070)->(-1814,1080)
Smooothing part at:Bounding box=(-1788,1071)->(-1784,1074)

Is it tess specific thing or a bug? negative numbers in bbox

stweil · 2021-03-29T11:58:32Z

I now have run latest Tesseract production code on the original animated GIF image. The image is processed, and Tesseract returns a "result" for the first included image. This takes 4:26 minutes, so it finishes, but takes rather long for an image which looks empty for me but obviously includes lots of small colour variations (otherwise the PNG file would be much smaller).

stweil · 2021-03-29T12:00:57Z

@wix-andriusb, how long did you wait for "never finished"? Depending on your machine, it might take at least 4 minutes, but maybe also 20 minutes. Of course this can nevertheless be considered as a bug.

stweil · 2021-03-29T12:04:15Z

Is it tess specific thing or a bug? negative numbers in bbox

The original image is 1080 x 1920, so those box coordinates look definitely strange, not only because they are negative, but also because the absolute x values exceed the image width.

egorpugin · 2021-03-29T12:06:19Z

We can try to cut the image to, let's say, 50x50 and check it.

egorpugin · 2021-03-29T12:09:21Z

Is is possible to implement faster pixel counting?

wix-andriusb · 2021-03-29T12:10:52Z

Locally (macosx) I have installed 4.11 with brew, I'm running it and i'm past 8 minutes now.

4.00 on a linux server was running for hours before it got killed

egorpugin · 2021-03-29T12:21:24Z

For 50x100 reduced image -

Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)
Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)

Full log of that loop -
1.txt

DanBloomberg · 2021-03-29T20:14:09Z

referring to Amit's comment, I attempted to implement writing of gif anim about 4 years ago, but failed.
I left questions for the gif inventor/maintainer, but he did not engage. So I Implemented writing
of webp anim instead.

Never tried reading animated gif into a pixa.

DanBloomberg · 2021-03-29T20:23:45Z

And if someone shows me how to tell if a gif file is an animated gif, I'll use it in the gif reader to skip ("not supported") reading.
I believe that would mostly solve this issue.

stweil · 2021-03-29T20:50:16Z

I don't think there is a high desire to have advanced OCR support for animated GIF file. That's a very special rare need. Obviously the first image in an animated GIF is already read and processed with the current code. Processing all images in a file can be done with a simple external conversion.

So the animated GIF issue has very low priority for me.

The huge time which is required to process an image without visible content is more important for me, as I expect that "normal" scans with text can suffer from extended processing time, too. And OCR processing time has high priority.

egorpugin · 2021-03-29T21:53:08Z

Funny, I tried to optimize hot path using pixCountPixelsInRect instead of pixCountPixels. I thought it won't create a new pix, but it does exactly the same as the commented code on the left side.

DanBloomberg · 2021-03-29T22:43:18Z

Both pixRasterop() and pixCountPixels() are optimized, so using them together -- first cropping the rectangle with rasterop and then counting the ON pixels -- is very efficient.

egorpugin · 2021-03-30T08:29:30Z

But is it possible to count pixels directly on the original pix?

DanBloomberg · 2021-03-31T00:29:09Z

Yes, of course, but it would be a bit complicated to do it efficiently.
The 1 bpp image has 32 pixels in each word.
Each raster line in general would have a partial word at the beginning, a series of complete words, and a partial word at the end.
And the first partial word might be the only one that has any pixels, so you have to worry about that case as well.
You would need to mask and shift the two partial words before running them, byte by byte through the table that counts ON pixels.
You can see how this is done for the last partial word in pixCountPixels().

You are welcome to extend pixCountPixels() to take an arbitrary rectangle :-)

egorpugin · 2021-03-31T00:38:03Z

I'm thinking here also about 8bpp b/w image to speedup such calcs. The question is how will this increase overall memory consumption. Do we really need 1bpp in tess?

DanBloomberg · 2021-03-31T01:04:15Z

All this pixel counting is for 1 bpp.
With 8 bpp it is much simpler to do most calculations efficiently. For example, with 8 bpp you might be making histograms.

stweil · 2021-03-31T17:36:18Z

@egorpugin, are you sure that pixCountPixels is the bottleneck here? gprof shows that most of the time is spent in 406709317 calls of GridSearch which calls 1355965899 times std::_Hashtable, so find and insert for the std::unordered_list are the time critical operations. Obviously that list is not small, so those operations cost a lot of time.

It is possible to optimize the code and use only insert, no find, but that gives only a very small improvement.

egorpugin · 2021-03-31T17:40:42Z

On windows I see that pixCountPixels is the slowest part.
See #3369 (comment)

DanBloomberg · 2021-03-31T18:02:43Z

Make sure that you are calling pixCountPixelsInRect() with tab8 defined as the 4th arg.
Otherwise, each time pixCountPixels() is called, it has to make the 256-entry tab8.

amitdo · 2021-05-14T16:19:45Z

https://bentkus.eu/ocr_while_loop.png

With the code from #3418, the processing ends after less than half second, when Sauvola binarization is used.

wix-andriusb changed the title ~~Endless loop~~ tesseract process never finishes with specific gif image Mar 29, 2021

stweil added the bug label Mar 29, 2021

stweil added this to the 6.0.0 milestone Mar 29, 2021

amitdo added feature request leptonica labels Mar 29, 2021

amitdo added the process hangs label Mar 29, 2021

stweil added the performance label Apr 4, 2021

This was referenced Apr 4, 2021

Tesseract seemingly stuck #3377

Open

Optimize Gridsearch and proprietary list classes #3380

Merged

amitdo mentioned this issue Jun 14, 2022

Animated GIF DanBloomberg/leptonica#626

Closed

amitdo added the binarization label Aug 27, 2024

tesseract process never finishes with specific gif image #3369

tesseract process never finishes with specific gif image #3369

Comments

wix-andriusb commented Mar 29, 2021 • edited by egorpugin Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

stweil commented Mar 29, 2021 • edited Loading

amitdo commented Mar 29, 2021

stweil commented Mar 29, 2021

stweil commented Mar 29, 2021

wix-andriusb commented Mar 29, 2021

wix-andriusb commented Mar 29, 2021 • edited Loading

egorpugin commented Mar 29, 2021

stweil commented Mar 29, 2021

wix-andriusb commented Mar 29, 2021

egorpugin commented Mar 29, 2021

stweil commented Mar 29, 2021

stweil commented Mar 29, 2021 • edited Loading

amitdo commented Mar 29, 2021

wix-andriusb commented Mar 29, 2021

amitdo commented Mar 29, 2021

amitdo commented Mar 29, 2021

egorpugin commented Mar 29, 2021

stweil commented Mar 29, 2021

stweil commented Mar 29, 2021

stweil commented Mar 29, 2021

egorpugin commented Mar 29, 2021

egorpugin commented Mar 29, 2021

wix-andriusb commented Mar 29, 2021

egorpugin commented Mar 29, 2021

DanBloomberg commented Mar 29, 2021

DanBloomberg commented Mar 29, 2021

stweil commented Mar 29, 2021

egorpugin commented Mar 29, 2021

DanBloomberg commented Mar 29, 2021

egorpugin commented Mar 30, 2021

DanBloomberg commented Mar 31, 2021

egorpugin commented Mar 31, 2021

DanBloomberg commented Mar 31, 2021

stweil commented Mar 31, 2021

egorpugin commented Mar 31, 2021

DanBloomberg commented Mar 31, 2021

amitdo commented May 14, 2021 • edited Loading

wix-andriusb commented Mar 29, 2021 •

edited by egorpugin

Loading

stweil commented Mar 29, 2021 •

edited

Loading

wix-andriusb commented Mar 29, 2021 •

edited

Loading

stweil commented Mar 29, 2021 •

edited

Loading

amitdo commented May 14, 2021 •

edited

Loading