Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract crashes when processing certain documents #1181

Closed
TerryZH opened this issue Oct 22, 2017 · 11 comments
Closed

Tesseract crashes when processing certain documents #1181

TerryZH opened this issue Oct 22, 2017 · 11 comments

Comments

@TerryZH
Copy link

TerryZH commented Oct 22, 2017

Environment

  • Tesseract Version:
    Tesseract Open Source OCR Engine v4.00.00dev-692-gad5ee184 with Leptonica

  • Platform:
    Platform: Linux 4.9.43-17.39.amzn1.x86_64 defect issue #1 SMP x86_64 GNU/Linux

Current Behavior:

The following command will crash in the above stated environment:
tesseract /tmp/tr_tmp.jpg /tmp/tr_tmp --tessdata-dir /var/task/tessdata --psm 12 --oem 2 -l eng hocr

tessdata is legacy data from tesseract-ocr/tessdata

Crash error: Assert failed:in file ../ccutil/unicharset.h, line 513

related jpg file:
c767234e7e51a92ee5a9c211f5892ad66b990e75-2

related binaries:
Archive.zip

Expected Behavior:

When tesseract v4 using --oem 2 and legacy trained data, error message of missing LSTM data should be printed instead of crashing.

Suggested Fix:

n/a

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 23, 2017

The latest traineddatas (tessdata_best and Tessdata_fast) do not support legacy tesseract engine, so --oem 0 and --oem 2 are not supported.

However, program should not crash but rather give an error message.

@TerryZH
Copy link
Author

TerryZH commented Oct 23, 2017

Thanks for the update. Where can I get the 'latest traineddata' please? I got my data from https://github.com/tesseract-ocr/tessdata/

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 23, 2017 via email

@TerryZH
Copy link
Author

TerryZH commented Oct 23, 2017

Thanks for your reply. Just to be rigid. I was using tessdata, which supports oem mode 2 according to the wiki.

@amitdo
Copy link
Collaborator

amitdo commented Oct 23, 2017

The latest traineddatas (tessdata_best and Tessdata_fast) do not support legacy tesseract engine, so --oem 0 and --oem 2 are not supported.

Although I wrote something similar to the above remark in the wiki, since commits [1] and [2]. --oem 0 -l osd works with 'best' and 'fast'. Not sure about --oem 2.

[1] tesseract-ocr/tessdata_best@f1d1268
[2] tesseract-ocr/tessdata_fast@139ff12

@TerryZH
Copy link
Author

TerryZH commented Oct 29, 2017

Since the discussion is a bit side-tracked, I'm repeating the problem. The original� comment is also updated.

Tesseract v4.00.00dev-692-gad5ee184 crashes when using --oem 2 and tesseract-ocr/tessdata.

@PaniniGelato
Copy link

The master branch (commit ad5ee18) runs good with your image @TerryZH

However, tesseract should not crash

@TerryZH
Copy link
Author

TerryZH commented Nov 4, 2017

@PaniniGelato Are you sure? As reported in the "Environment" section, this crash happens on the build from master commit ad5ee18. And I can still reproduce the crash. Did you use the same tessdata as described in my previous comment?

@zdenop
Copy link
Contributor

zdenop commented Oct 1, 2018

It does not crash on windows (the latest code):

>tesseract.exe i1181.jpg i1181 --psm 12 --oem 2 -l eng hocr
Tesseract Open Source OCR Engine v4.0.0-beta.4 with Leptonica

I will try to test it on linux later. In meantime please check if you are using the latest traineddata...

@stweil
Copy link
Member

stweil commented Oct 1, 2018

This is the well known assertion ASSERT_HOST(contains_unichar_id(unichar_id)); which occurs with certain images, but only when using both legacy and LSTM OCR engine.

@zdenop
Copy link
Contributor

zdenop commented Oct 9, 2018

duplicate to #1205.
Anyway it should be fixed with 9efedc1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants