Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to detect text in few images when using multi-language hints #1222

Closed
t6nand opened this issue Nov 26, 2017 · 9 comments
Closed

Unable to detect text in few images when using multi-language hints #1222

t6nand opened this issue Nov 26, 2017 · 9 comments

Comments

@t6nand
Copy link

t6nand commented Nov 26, 2017

I am using tesseract-OCR to extract text from images which can contain text in many popular Indian languages (it can also be bi-lingual or multi-lingual in the same image).
While trying multi-language hints in my command, I often encounter error
contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513

On manually checking that image and providing that language hint works correctly.

Command resulting in error:
tesseract --tessdata-dir /usr/local/share/tessdata/ tamil.jpg -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar stdout

Correctly working command:
tesseract --tessdata-dir /usr/local/share/tessdata/ tamil.jpg -l tam stdout

Affected Sample Image:
tamil

Environment

  • Tesseract Version:
    tesseract 4.00.00alpha
    leptonica-1.74.1
    libjpeg 8d : libpng 1.6.32 : libtiff 4.0.8 : zlib 1.2.8
    Found AVX2
    Found AVX
    Found SSE
  • Commit Number: ebbfc3a
  • Platform: Darwin - Kernel Version 16.7.0 (MacOS)

Current Behavior:

Using multi-language hints in command results in error:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513

Expected Behavior:

It should extract text matching the language profiles provided in command without error.

Suggested Fix:

Since directly using text's language profile works correctly, it may have to do something in providing multi-language hints. So, in providing multiple languages this behavior is unexpected and may need correction.

@lloiodice
Copy link

Found same problem running one of the UNLV dataset images
tesseract 9444_011.2B.tif 9444_011.2B.tif

As work-around, Modified code ../ccutil/unicharset.h, line 513
From
ASSERT_HOST(contains_unichar_id(unichar_id));
To
if (!contains_unichar_id(unichar_id)) return false;

After that I was able to OCR the image and get valid text for it

@IamDixit
Copy link

I had a similar issue.
Got it solved by taking a new training set from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

@amitdo
Copy link
Collaborator

amitdo commented Oct 15, 2018

@t6nand,

Please retest it with 4.0.0-rc3

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 13, 2019

Retest with

tesseract 4.1.0-rc1-255-g332a1
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Command as originally given does not work - stdout needs to be given after image name

$ tesseract indic.jpg -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar stdout
read_params_file: Can't open hin+eng+tam+guj+pan+kan+mal+ben+tel+mar
read_params_file: Can't open stdout
Tesseract Open Source OCR Engine v4.1.0-rc1-255-g332a1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Detected 38 diacritics

No assert when multiple language hints given

 tesseract indic.jpg stdout -l hin+eng+tam+guj+pan+kan+mal+ben+tel+mar
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 334
Detected 44 diacritics
0000390 ಪ್ರೇ) நீங்கள்‌ நினைவில்‌
ம வேண்டிய ஒரு உண்மை
_ இந்த நிமிடம்‌கூட நிரந்தரமில்லை

3 ந

ഛീ

Recognition is different when only one language is given

tesseract indic.jpg stdout -l tam
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Detected 38 diacritics
றன்‌ இமான்‌ நீங்கள்‌ நினைவில்‌
ம வேண்டிய ஒரு உண்மை

இந்த நிமிடம்‌?கூட நிரந்தரமில்லை
5) நத

க நட
த.

The suggested approach could be to use --psm 0 to identify the script and then use appropriate language data.

 tesseract indic.jpg stdout --psm 0
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 2.51
Script: Tamil
Script confidence: 3.89

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 13, 2019

Also, using --psm 6 gives better recognition.
Image has 3 lines of text but very colorful and busy background.

 tesseract indic.jpg stdout -l tam --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
க்‌ க்‌
இன்பத்திலும்‌" துன்பத்திலும்‌ மட்டி)
வைத்துக்கொள்ளவேண்டிய ஒரு உண்மை
இந்த நிமிடம்‌ கூட நிரந்தரமில்லை
டு வு த
ஜு
ந்‌ கு]
த்‌ வ்‌
8 ்‌
ubuntu@tesseract-ocr:~/TEST$ tesseract indic.jpg stdout -l tam --psm 6 --dpi 300
க்‌ க்‌
இன்பத்திலும்‌" துன்பத்திலும்‌ மட்டி)
வைத்துக்கொள்ளவேண்டிய ஒரு உண்மை
இந்த நிமிடம்‌ கூட நிரந்தரமில்லை
டு வு த
ஜு
ந்‌ கு]
த்‌ வ்‌
8 ்‌

@stweil stweil changed the title Unable to detect text in few images when using muti-language hints Unable to detect text in few images when using multi-language hints Jun 22, 2019
@stweil
Copy link
Member

stweil commented Jun 22, 2019

@t6nand, is this issue solved for you?

@t6nand
Copy link
Author

t6nand commented Jun 26, 2019

@stweil Hi, I can't ascertain if it's solved as this was a long time ago and back then, I worked on some workaround to get the best-approximated results. For now, I will have to set up the latest tesseract-ocr and try out the suggestions as discussed on this thread. I will post my findings in some days. Thanks.

@t6nand
Copy link
Author

t6nand commented Jul 22, 2019

@stweil, @Shreeshrii .

The original issue seems to have been resolved as there's no contains_unichar_id error. However, the intelligibility of OCRed text is quite different from using multi-language hints when compared to using only a single hint as also discussed by @Shreeshrii in his suggestion. Also, I receive a segmentation fault issue when using the command tesseract indic.jpg stdout --psm 0.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jul 23, 2019

I receive a segmentation fault issue when using the command tesseract indic.jpg stdout --psm 0.

I tested just now with current master and do not get the segfault. Please post full details of your setup.

tesseract indic.jpg stdout --psm 0

Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 350
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 2.51
Script: Tamil
Script confidence: 3.89

tesseract -v

tesseract 5.0.0-alpha-322-g74ac
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants