-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to detect text in few images when using multi-language hints #1222
Comments
Found same problem running one of the UNLV dataset images As work-around, Modified code ../ccutil/unicharset.h, line 513 After that I was able to OCR the image and get valid text for it |
I had a similar issue. |
Please retest it with 4.0.0-rc3 |
Retest with
Command as originally given does not work -
No assert when multiple language hints given
Recognition is different when only one language is given
The suggested approach could be to use
|
Also, using --psm 6 gives better recognition.
|
@t6nand, is this issue solved for you? |
@stweil Hi, I can't ascertain if it's solved as this was a long time ago and back then, I worked on some workaround to get the best-approximated results. For now, I will have to set up the latest tesseract-ocr and try out the suggestions as discussed on this thread. I will post my findings in some days. Thanks. |
The original issue seems to have been resolved as there's no |
I tested just now with current master and do not get the segfault. Please post full details of your setup.
|
I am using tesseract-OCR to extract text from images which can contain text in many popular Indian languages (it can also be bi-lingual or multi-lingual in the same image).
While trying multi-language hints in my command, I often encounter error
contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513
On manually checking that image and providing that language hint works correctly.
Command resulting in error:
tesseract --tessdata-dir /usr/local/share/tessdata/ tamil.jpg -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar stdout
Correctly working command:
tesseract --tessdata-dir /usr/local/share/tessdata/ tamil.jpg -l tam stdout
Affected Sample Image:
Environment
tesseract 4.00.00alpha
leptonica-1.74.1
libjpeg 8d : libpng 1.6.32 : libtiff 4.0.8 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
Current Behavior:
Using multi-language hints in command results in error:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513
Expected Behavior:
It should extract text matching the language profiles provided in command without error.
Suggested Fix:
Since directly using text's language profile works correctly, it may have to do something in providing multi-language hints. So, in providing multiple languages this behavior is unexpected and may need correction.
The text was updated successfully, but these errors were encountered: