Add support for Hebrew Language and Alphabet #13797

I didn't train yet a model for Hebrew on PaddleOCR, got stuck on work with YOLOv8/PyLaia and kraken for Hebrew, and with Hebrew I mean classical Hebrew, medieval not Ivrit

luotao1 · 2024-10-15T06:26:09Z

@johnlockejrr Thanks for your contribution! You will receive a beautiful PaddlePaddle gift. Please provide your mailing address by filling out the following questionnaire before October 18th.

Looking forward to the future, we will walk further together in the world of open source!
Click Here ：https://paddle.wjx.cn/vm/h4On9gJ.aspx#

ephraimm · 2024-10-15T07:07:37Z

Is there anyway I can help. What are the next steps?

johnlockejrr · 2024-10-15T10:43:57Z

@ephraimm, sure, I'm right now collecting as much ground-truth/datasets of very good quality for Hebrew. In my case, I mostly work with ancient/medieval Hebrew (biblical, rabbinical, responsa etc. stuff) but Ivrit also. We can collaborate if you want.

ephraimm · 2024-10-15T11:46:15Z

@johnlockejrr we're after the same thing.
My end game is to digitise printed (chassidic) texts that usually have complex section breaks and columns. I have had some success combining tesseract with google vision, but this is not ideal and would like to find an open source solution.

tesseract is good at recognising blocks, but not good at distinguishing between actual text and subscript references to footnotes and punctuation. google on the other hand is great at accurately recognising the characters but doesn't recognise column breaks etc.

My current hypothesis is if I give tesseract a good quality sample of text from google and fine tune the existing Hebrew model we'll get the results.

What makes you think paddle will work? Have you tried fine tuning tesseract?

johnlockejrr · 2024-10-15T12:20:13Z

Then take a look at eScriptorium that uses kraken but also can use tesseract (that is not actively developped anymore).
Printed texts are piece of cake for kraken, though you have to train a good segmentation (text and layers detection) model. I mostly work with manuscripts, HTR is a little more difficult than printed texts.

Drop me an email [at] gmail.com

ephraimm · 2024-10-15T20:15:18Z

@johnlockejrr
sent to gee male. you can do likewise

luotao1 · 2024-11-01T06:50:08Z

@johnlockejrr Could you provide your email? We need this information when mailing gifts. Please send your email to ext_paddle_oss@baidu.com or here. Thanks very much!

johnlockejrr · 2024-11-01T08:37:01Z

@luotao1 is: johnlockejrr [at] gmail.com

johnlockejrr added 2 commits August 31, 2024 01:07

Add Hebrew language support for training

422733f

https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

Add Hebrew language dictionary

e8fffa0

https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

johnlockejrr added 2 commits August 31, 2024 01:35

johnlockejrr commented Aug 31, 2024

View reviewed changes

GreatV reviewed Aug 31, 2024

View reviewed changes

Update hebrew_dict.txt

4be0394

GreatV approved these changes Sep 1, 2024

View reviewed changes

GreatV merged commit 6225a90 into PaddlePaddle:main Sep 1, 2024
3 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2024

paddle-bot bot added the contributor label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Hebrew Language and Alphabet #13797

Add support for Hebrew Language and Alphabet #13797

johnlockejrr commented Aug 31, 2024

CLAassistant commented Aug 31, 2024 •

edited

Loading

johnlockejrr left a comment

GreatV left a comment

johnlockejrr commented Aug 31, 2024

GreatV left a comment

ephraimm commented Oct 10, 2024

johnlockejrr commented Oct 10, 2024 •

edited

Loading

luotao1 commented Oct 15, 2024

ephraimm commented Oct 15, 2024 via email

johnlockejrr commented Oct 15, 2024 •

edited

Loading

ephraimm commented Oct 15, 2024

johnlockejrr commented Oct 15, 2024 •

edited

Loading

ephraimm commented Oct 15, 2024

luotao1 commented Nov 1, 2024

johnlockejrr commented Nov 1, 2024

Add support for Hebrew Language and Alphabet #13797

Add support for Hebrew Language and Alphabet #13797

Conversation

johnlockejrr commented Aug 31, 2024

CLAassistant commented Aug 31, 2024 • edited Loading

johnlockejrr left a comment

Choose a reason for hiding this comment

GreatV left a comment

Choose a reason for hiding this comment

johnlockejrr commented Aug 31, 2024

GreatV left a comment

Choose a reason for hiding this comment

ephraimm commented Oct 10, 2024

johnlockejrr commented Oct 10, 2024 • edited Loading

luotao1 commented Oct 15, 2024

ephraimm commented Oct 15, 2024 via email

johnlockejrr commented Oct 15, 2024 • edited Loading

ephraimm commented Oct 15, 2024

johnlockejrr commented Oct 15, 2024 • edited Loading

ephraimm commented Oct 15, 2024

luotao1 commented Nov 1, 2024

johnlockejrr commented Nov 1, 2024

CLAassistant commented Aug 31, 2024 •

edited

Loading

johnlockejrr commented Oct 10, 2024 •

edited

Loading

johnlockejrr commented Oct 15, 2024 •

edited

Loading

johnlockejrr commented Oct 15, 2024 •

edited

Loading