Unable to detect text in few images when using multi-language hints #1222

t6nand · 2017-11-26T09:18:04Z

I am using tesseract-OCR to extract text from images which can contain text in many popular Indian languages (it can also be bi-lingual or multi-lingual in the same image).
While trying multi-language hints in my command, I often encounter error
contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513

On manually checking that image and providing that language hint works correctly.

Command resulting in error:
tesseract --tessdata-dir /usr/local/share/tessdata/ tamil.jpg -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar stdout

Correctly working command:
tesseract --tessdata-dir /usr/local/share/tessdata/ tamil.jpg -l tam stdout

Affected Sample Image:

Environment

Tesseract Version:
tesseract 4.00.00alpha
leptonica-1.74.1
libjpeg 8d : libpng 1.6.32 : libtiff 4.0.8 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
Commit Number: ebbfc3a
Platform: Darwin - Kernel Version 16.7.0 (MacOS)

Current Behavior:

Using multi-language hints in command results in error:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513

Expected Behavior:

It should extract text matching the language profiles provided in command without error.

Suggested Fix:

Since directly using text's language profile works correctly, it may have to do something in providing multi-language hints. So, in providing multiple languages this behavior is unexpected and may need correction.

The text was updated successfully, but these errors were encountered:

lloiodice · 2017-12-15T14:53:35Z

Found same problem running one of the UNLV dataset images
tesseract 9444_011.2B.tif 9444_011.2B.tif

As work-around, Modified code ../ccutil/unicharset.h, line 513
From
ASSERT_HOST(contains_unichar_id(unichar_id));
To
if (!contains_unichar_id(unichar_id)) return false;

After that I was able to OCR the image and get valid text for it

tesseract-ocr#1222)

IamDixit · 2018-07-31T06:43:15Z

I had a similar issue.
Got it solved by taking a new training set from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

amitdo · 2018-10-15T08:58:33Z

@t6nand,

Please retest it with 4.0.0-rc3

Shreeshrii · 2019-04-13T12:01:54Z

Retest with

tesseract 4.1.0-rc1-255-g332a1
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Command as originally given does not work - stdout needs to be given after image name

$ tesseract indic.jpg -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar stdout
read_params_file: Can't open hin+eng+tam+guj+pan+kan+mal+ben+tel+mar
read_params_file: Can't open stdout
Tesseract Open Source OCR Engine v4.1.0-rc1-255-g332a1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Detected 38 diacritics

No assert when multiple language hints given

 tesseract indic.jpg stdout -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 334
Detected 44 diacritics
0000390 ಪ್ರೇ) நீங்கள்‌ நினைவில்‌
ம வேண்டிய ஒரு உண்மை
_ இந்த நிமிடம்‌கூட நிரந்தரமில்லை

3 ந

ഛീ

Recognition is different when only one language is given

tesseract indic.jpg stdout -l tam
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Detected 38 diacritics
றன்‌ இமான்‌ நீங்கள்‌ நினைவில்‌
ம வேண்டிய ஒரு உண்மை

இந்த நிமிடம்‌?கூட நிரந்தரமில்லை
5) நத

க நட
த.

The suggested approach could be to use --psm 0 to identify the script and then use appropriate language data.

 tesseract indic.jpg stdout --psm 0
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 2.51
Script: Tamil
Script confidence: 3.89

Shreeshrii · 2019-04-13T12:07:54Z

Also, using --psm 6 gives better recognition.
Image has 3 lines of text but very colorful and busy background.

 tesseract indic.jpg stdout -l tam --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
க்‌ க்‌
இன்பத்திலும்‌" துன்பத்திலும்‌ மட்டி)
வைத்துக்கொள்ளவேண்டிய ஒரு உண்மை
இந்த நிமிடம்‌ கூட நிரந்தரமில்லை
டு வு த
ஜு
ந்‌ கு]
த்‌ வ்‌
8 ்‌
ubuntu@tesseract-ocr:~/TEST$ tesseract indic.jpg stdout -l tam --psm 6 --dpi 300
க்‌ க்‌
இன்பத்திலும்‌" துன்பத்திலும்‌ மட்டி)
வைத்துக்கொள்ளவேண்டிய ஒரு உண்மை
இந்த நிமிடம்‌ கூட நிரந்தரமில்லை
டு வு த
ஜு
ந்‌ கு]
த்‌ வ்‌
8 ்‌

stweil · 2019-06-22T18:18:37Z

@t6nand, is this issue solved for you?

t6nand · 2019-06-26T17:50:16Z

@stweil Hi, I can't ascertain if it's solved as this was a long time ago and back then, I worked on some workaround to get the best-approximated results. For now, I will have to set up the latest tesseract-ocr and try out the suggestions as discussed on this thread. I will post my findings in some days. Thanks.

t6nand · 2019-07-22T21:28:22Z

@stweil, @Shreeshrii .

The original issue seems to have been resolved as there's no contains_unichar_id error. However, the intelligibility of OCRed text is quite different from using multi-language hints when compared to using only a single hint as also discussed by @Shreeshrii in his suggestion. Also, I receive a segmentation fault issue when using the command tesseract indic.jpg stdout --psm 0.

Shreeshrii · 2019-07-23T03:40:02Z

I receive a segmentation fault issue when using the command tesseract indic.jpg stdout --psm 0.

I tested just now with current master and do not get the segfault. Please post full details of your setup.

tesseract indic.jpg stdout --psm 0

Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 2.51
Script: Tamil
Script confidence: 3.89

tesseract -v

tesseract 5.0.0-alpha-322-g74ac
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

syzer added a commit to syzer/tesseract that referenced this issue Jan 23, 2018

Fix Unable to detect text in few images when using muti-language hints (

a90e499

tesseract-ocr#1222)

amitdo mentioned this issue Feb 8, 2018

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Closed

stweil mentioned this issue Mar 10, 2019

Issue 13590: tesseract-ocr/fuzzer-api: Heap-buffer-overflow in GenericVector<int>::size #2298

Closed

stweil changed the title ~~Unable to detect text in few images when using muti-language hints~~ Unable to detect text in few images when using multi-language hints Jun 22, 2019

amitdo mentioned this issue Feb 7, 2021

multilingual ocr ara+eng #2626

Open

amitdo closed this as completed Feb 7, 2021

amitdo added the multilingual ocr label Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to detect text in few images when using multi-language hints #1222

Unable to detect text in few images when using multi-language hints #1222

t6nand commented Nov 26, 2017

lloiodice commented Dec 15, 2017

IamDixit commented Jul 31, 2018

amitdo commented Oct 15, 2018

Shreeshrii commented Apr 13, 2019 •

edited

Loading

Shreeshrii commented Apr 13, 2019 •

edited

Loading

stweil commented Jun 22, 2019

t6nand commented Jun 26, 2019

t6nand commented Jul 22, 2019 •

edited

Loading

Shreeshrii commented Jul 23, 2019 •

edited

Loading

Unable to detect text in few images when using multi-language hints #1222

Unable to detect text in few images when using multi-language hints #1222

Comments

t6nand commented Nov 26, 2017

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

lloiodice commented Dec 15, 2017

IamDixit commented Jul 31, 2018

amitdo commented Oct 15, 2018

Shreeshrii commented Apr 13, 2019 • edited Loading

Shreeshrii commented Apr 13, 2019 • edited Loading

stweil commented Jun 22, 2019

t6nand commented Jun 26, 2019

t6nand commented Jul 22, 2019 • edited Loading

Shreeshrii commented Jul 23, 2019 • edited Loading

Shreeshrii commented Apr 13, 2019 •

edited

Loading

Shreeshrii commented Apr 13, 2019 •

edited

Loading

t6nand commented Jul 22, 2019 •

edited

Loading

Shreeshrii commented Jul 23, 2019 •

edited

Loading