Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract process never finishes with specific gif image #3369

Open
wix-andriusb opened this issue Mar 29, 2021 · 37 comments
Open

tesseract process never finishes with specific gif image #3369

wix-andriusb opened this issue Mar 29, 2021 · 37 comments

Comments

@wix-andriusb
Copy link

wix-andriusb commented Mar 29, 2021

Environment

tesseract 4.1.1

reproduced on macosx and linux

uname -a
Darwin VL-C02WL1AYHTD6 19.6.0 Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64 x86_64
Linux ocr-5b7bf86f6-f6qsd 5.4.65-wix #1 SMP Thu Nov 19 15:24:12 UTC 2020 x86_64 GNU/Linux

Current Behavior:

running tesseract in command line on this image https://bentkus.eu/ocr_while_true.gif does not finish after 1h

tesseract ocr_while_true.gif ocr_while_true --dpi 150

Expected Behavior:

process should finish in 2 minutes

Suggested Fix:

I'll try to build and see why it never stops

upd. (by @egorpugin):
test png - https://bentkus.eu/ocr_while_loop.png

@wix-andriusb wix-andriusb changed the title Endless loop tesseract process never finishes with specific gif image Mar 29, 2021
@stweil
Copy link
Member

stweil commented Mar 29, 2021

That GIF file is special: it includes 125 images. How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?

@amitdo
Copy link
Collaborator

amitdo commented Mar 29, 2021

This is a gif animation.

Convert it to static images and give them to
tesserct as input.

@stweil
Copy link
Member

stweil commented Mar 29, 2021

Other issues where OCR never finishes: #2196, #2288.

@stweil stweil added the bug label Mar 29, 2021
@stweil
Copy link
Member

stweil commented Mar 29, 2021

Convert it to static images [...]

The static images work fine. Nevertheless handling of animated GIF images has to be well defined, see my question above.

@wix-andriusb
Copy link
Author

Ok, thanks for the advice, I should handle this on my side, check for gif and slice and analyze it.
I tried other gifs and saw it finishing so I assumed that this should work too on this gif.

@wix-andriusb
Copy link
Author

wix-andriusb commented Mar 29, 2021

Can you maybe tell me what tool you used to create the static images?

@egorpugin
Copy link
Contributor

How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?

My answer is 'Create OCR for all images'

@stweil
Copy link
Member

stweil commented Mar 29, 2021

Can you maybe tell me what tool you used to create the static images?

I used convert FROM.gif TO.png.

@wix-andriusb
Copy link
Author

My answer is 'Create OCR for all images'

Could be configurable through arguments with the default being do OCR for all images

@egorpugin
Copy link
Contributor

It's should be something like cat 1.txt 2.txt 3.txt ...
When we pass multiple images, they all must be processed. Same for multipage images (.gif, .tiff) if such format is enabled in leptonica.

@stweil
Copy link
Member

stweil commented Mar 29, 2021

So the handling of animated GIF should be similar to multipage TIFF (which either processes all pages or a selected page as far as I remember).

Maybe in a first step throwing an "unimplemented" error is easier. I am not sure how Leptonica supports animated GIF.

@stweil
Copy link
Member

stweil commented Mar 29, 2021

The static images work fine.

I was mistaken. Not all static images work fine. The first one which looks empty does not terminate requires more than 4 minutes.

@amitdo
Copy link
Collaborator

amitdo commented Mar 29, 2021

We depend on Leptonica for image IO. Can it handle gif animation? @DanBloomberg

What we need from Leptonica:

  • Detect if the image contains animatation.
  • If it does, make it possible to iterate over the images (frames) in the file, returning one pix at each iteration.

This way we can treat it like we treat multi-page tiff.

@stweil stweil added this to the 6.0.0 milestone Mar 29, 2021
@wix-andriusb
Copy link
Author

this image is the offender, a blank page with specific color

@amitdo
Copy link
Collaborator

amitdo commented Mar 29, 2021

The first one which looks empty does not terminate.

So Leptonica probably only sees the first image and returns it as pix.

@amitdo
Copy link
Collaborator

amitdo commented Mar 29, 2021

Please attach the first image.

@egorpugin
Copy link
Contributor

I've recorded first N GBs of debug logs in the infinite loop.

Smooothing part at:Bounding box=(-1888,1064)->(-1884,1067)
Smooothing part at:Bounding box=(-1886,1066)->(-1882,1069)
Smooothing part at:Bounding box=(-1884,1067)->(-1879,1071)
Smooothing part at:Bounding box=(-1874,1074)->(-1870,1077)
Smooothing part at:Bounding box=(-1825,1070)->(-1814,1080)
Smooothing part at:Bounding box=(-1788,1071)->(-1784,1074)

Is it tess specific thing or a bug? negative numbers in bbox

@stweil
Copy link
Member

stweil commented Mar 29, 2021

I now have run latest Tesseract production code on the original animated GIF image. The image is processed, and Tesseract returns a "result" for the first included image. This takes 4:26 minutes, so it finishes, but takes rather long for an image which looks empty for me but obviously includes lots of small colour variations (otherwise the PNG file would be much smaller).

@stweil
Copy link
Member

stweil commented Mar 29, 2021

@wix-andriusb, how long did you wait for "never finished"? Depending on your machine, it might take at least 4 minutes, but maybe also 20 minutes. Of course this can nevertheless be considered as a bug.

@stweil
Copy link
Member

stweil commented Mar 29, 2021

Is it tess specific thing or a bug? negative numbers in bbox

The original image is 1080 x 1920, so those box coordinates look definitely strange, not only because they are negative, but also because the absolute x values exceed the image width.

@egorpugin
Copy link
Contributor

We can try to cut the image to, let's say, 50x50 and check it.

@egorpugin
Copy link
Contributor

Is is possible to implement faster pixel counting?

image

@wix-andriusb
Copy link
Author

Locally (macosx) I have installed 4.11 with brew, I'm running it and i'm past 8 minutes now.

4.00 on a linux server was running for hours before it got killed
image

@egorpugin
Copy link
Contributor

For 50x100 reduced image -

Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)
Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)

Full log of that loop -
1.txt

@DanBloomberg
Copy link

referring to Amit's comment, I attempted to implement writing of gif anim about 4 years ago, but failed.
I left questions for the gif inventor/maintainer, but he did not engage. So I Implemented writing
of webp anim instead.

Never tried reading animated gif into a pixa.

@DanBloomberg
Copy link

And if someone shows me how to tell if a gif file is an animated gif, I'll use it in the gif reader to skip ("not supported") reading.
I believe that would mostly solve this issue.

@stweil
Copy link
Member

stweil commented Mar 29, 2021

I don't think there is a high desire to have advanced OCR support for animated GIF file. That's a very special rare need. Obviously the first image in an animated GIF is already read and processed with the current code. Processing all images in a file can be done with a simple external conversion.

So the animated GIF issue has very low priority for me.

The huge time which is required to process an image without visible content is more important for me, as I expect that "normal" scans with text can suffer from extended processing time, too. And OCR processing time has high priority.

@egorpugin
Copy link
Contributor

Funny, I tried to optimize hot path using pixCountPixelsInRect instead of pixCountPixels. I thought it won't create a new pix, but it does exactly the same as the commented code on the left side.

image

@DanBloomberg
Copy link

Both pixRasterop() and pixCountPixels() are optimized, so using them together -- first cropping the rectangle with rasterop and then counting the ON pixels -- is very efficient.

@egorpugin
Copy link
Contributor

But is it possible to count pixels directly on the original pix?

@DanBloomberg
Copy link

Yes, of course, but it would be a bit complicated to do it efficiently.
The 1 bpp image has 32 pixels in each word.
Each raster line in general would have a partial word at the beginning, a series of complete words, and a partial word at the end.
And the first partial word might be the only one that has any pixels, so you have to worry about that case as well.
You would need to mask and shift the two partial words before running them, byte by byte through the table that counts ON pixels.
You can see how this is done for the last partial word in pixCountPixels().

You are welcome to extend pixCountPixels() to take an arbitrary rectangle :-)

@egorpugin
Copy link
Contributor

I'm thinking here also about 8bpp b/w image to speedup such calcs. The question is how will this increase overall memory consumption. Do we really need 1bpp in tess?

@DanBloomberg
Copy link

All this pixel counting is for 1 bpp.
With 8 bpp it is much simpler to do most calculations efficiently. For example, with 8 bpp you might be making histograms.

@stweil
Copy link
Member

stweil commented Mar 31, 2021

@egorpugin, are you sure that pixCountPixels is the bottleneck here? gprof shows that most of the time is spent in 406709317 calls of GridSearch which calls 1355965899 times std::_Hashtable, so find and insert for the std::unordered_list are the time critical operations. Obviously that list is not small, so those operations cost a lot of time.

It is possible to optimize the code and use only insert, no find, but that gives only a very small improvement.

@egorpugin
Copy link
Contributor

On windows I see that pixCountPixels is the slowest part.
See #3369 (comment)

@DanBloomberg
Copy link

Make sure that you are calling pixCountPixelsInRect() with tab8 defined as the 4th arg.
Otherwise, each time pixCountPixels() is called, it has to make the 256-entry tab8.

@amitdo
Copy link
Collaborator

amitdo commented May 14, 2021

https://bentkus.eu/ocr_while_loop.png

With the code from #3418, the processing ends after less than half second, when Sauvola binarization is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants