OCR is not able to detect text other than English #302

tagliala · 2024-09-18T08:29:42Z

Tika requires a specific config for tesseract parser in order to be able to OCR languages other than English

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"></parser>
    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
      <params>
        <param name="language" type="string">eng+ara</param>
      </params>
    </parser>
  </parsers>
</properties>

The generic OS implementation may not require other languages, but Colore should be able to specify the language to pass to tesseract

Close #302

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

This commit addresses issue #302, which required Tika to specify a language for OCR other than English. The changes ensure that the application can specify the language configuration for Tesseract OCR. Close #302

This change ensure that Tika can specify the language configuration for its internal Tesseract OCR Parser. Close #302

tagliala changed the title ~~OCR is not able to detect text other than english~~ OCR is not able to detect text other than English Sep 18, 2024

tagliala added a commit that referenced this issue Sep 23, 2024

Specify language for text extraction

ca88ab6

Close #302

tagliala added a commit that referenced this issue Sep 23, 2024

Specify language for text extraction

156ac3f

Close #302

tagliala mentioned this issue Sep 23, 2024

Specify language for text extraction #314

Merged

tagliala added a commit that referenced this issue Sep 23, 2024

Specify language for text extraction

ac1fac8

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

tagliala added a commit that referenced this issue Sep 24, 2024

Specify language for text extraction

714e56e

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

tagliala added a commit that referenced this issue Sep 24, 2024

Specify language for text extraction

96d9b53

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

tagliala added a commit that referenced this issue Sep 24, 2024

Specify language for text extraction

3635299

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

tagliala added a commit that referenced this issue Sep 24, 2024

Specify language for text extraction

ab75267

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

tagliala added a commit that referenced this issue Sep 27, 2024

Specify language for text extraction

f8473fb

For uniformity with Tesseract execution, Tika extraction will raise if the language extension is not available Close #302

tagliala added a commit that referenced this issue Sep 27, 2024

Add support for specifying OCR language in Tika

8a82b10

This change ensure that Tika can specify the language configuration for its internal Tesseract OCR Parser. Close #302

tagliala closed this as completed in #314 Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR is not able to detect text other than English #302

OCR is not able to detect text other than English #302

tagliala commented Sep 18, 2024

OCR is not able to detect text other than English #302

OCR is not able to detect text other than English #302

Comments

tagliala commented Sep 18, 2024