Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR is not able to detect text other than English #302

Closed
tagliala opened this issue Sep 18, 2024 · 0 comments · Fixed by #314
Closed

OCR is not able to detect text other than English #302

tagliala opened this issue Sep 18, 2024 · 0 comments · Fixed by #314

Comments

@tagliala
Copy link
Member

Tika requires a specific config for tesseract parser in order to be able to OCR languages other than English

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"></parser>
    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
      <params>
        <param name="language" type="string">eng+ara</param>
      </params>
    </parser>
  </parsers>
</properties>

The generic OS implementation may not require other languages, but Colore should be able to specify the language to pass to tesseract

@tagliala tagliala changed the title OCR is not able to detect text other than english OCR is not able to detect text other than English Sep 18, 2024
tagliala added a commit that referenced this issue Sep 23, 2024
tagliala added a commit that referenced this issue Sep 23, 2024
tagliala added a commit that referenced this issue Sep 23, 2024
For uniformity with Tesseract execution, Tika extraction will raise
if the language extension is not available

Close #302
tagliala added a commit that referenced this issue Sep 24, 2024
For uniformity with Tesseract execution, Tika extraction will raise
if the language extension is not available

Close #302
tagliala added a commit that referenced this issue Sep 24, 2024
For uniformity with Tesseract execution, Tika extraction will raise
if the language extension is not available

Close #302
tagliala added a commit that referenced this issue Sep 24, 2024
For uniformity with Tesseract execution, Tika extraction will raise
if the language extension is not available

Close #302
tagliala added a commit that referenced this issue Sep 24, 2024
For uniformity with Tesseract execution, Tika extraction will raise
if the language extension is not available

Close #302
tagliala added a commit that referenced this issue Sep 27, 2024
For uniformity with Tesseract execution, Tika extraction will raise
if the language extension is not available

Close #302
tagliala added a commit that referenced this issue Sep 27, 2024
This commit addresses issue #302, which required Tika to specify a
language for OCR other than English.

The changes ensure that the application can specify the language
configuration for Tesseract OCR.

Close #302
tagliala added a commit that referenced this issue Sep 27, 2024
This change ensure that Tika can specify the language configuration
for its internal Tesseract OCR Parser.

Close #302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant