Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Hebrew Language and Alphabet #13797

Merged
merged 5 commits into from
Sep 1, 2024
Merged

Add support for Hebrew Language and Alphabet #13797

merged 5 commits into from
Sep 1, 2024

Conversation

johnlockejrr
Copy link
Contributor

@CLAassistant
Copy link

CLAassistant commented Aug 31, 2024

CLA assistant check
All committers have signed the CLA.

Samaritan Script is RTL like Arabic and Hebrew, used for Samaritan Hebrew and Aramaic, sometimes has Arabic letters in some texts.

https://en.wikipedia.org/wiki/Samaritan_(Unicode_block)
https://en.wikipedia.org/wiki/Samaritan_Hebrew
https://en.wikipedia.org/wiki/Samaritan_Aramaic_language
Samaritan Script is RTL like Arabic and Hebrew, used for Samaritan Hebrew and Aramaic, sometimes has Arabic letters in some texts.

https://en.wikipedia.org/wiki/Samaritan_(Unicode_block)
https://en.wikipedia.org/wiki/Samaritan_Hebrew
https://en.wikipedia.org/wiki/Samaritan_Aramaic_language
Copy link
Contributor Author

@johnlockejrr johnlockejrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the script failed beeing killed

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix codestyle

@johnlockejrr
Copy link
Contributor Author

Fixed

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GreatV GreatV merged commit 6225a90 into PaddlePaddle:main Sep 1, 2024
3 checks passed
@ephraimm
Copy link

Hi @johnlockejrr

I'd like to test the hebrew support, but having trouble getting a compiled version working to test. could you point me in the right direction.

many thanks

@johnlockejrr
Copy link
Contributor Author

johnlockejrr commented Oct 10, 2024

I didn't train yet a model for Hebrew on PaddleOCR, got stuck on work with YOLOv8/PyLaia and kraken for Hebrew, and with Hebrew I mean classical Hebrew, medieval not Ivrit

@luotao1
Copy link
Collaborator

luotao1 commented Oct 15, 2024

@johnlockejrr Thanks for your contribution! You will receive a beautiful PaddlePaddle gift. Please provide your mailing address by filling out the following questionnaire before October 18th.

Looking forward to the future, we will walk further together in the world of open source!
Click Here :https://paddle.wjx.cn/vm/h4On9gJ.aspx#

@ephraimm
Copy link

ephraimm commented Oct 15, 2024 via email

@johnlockejrr
Copy link
Contributor Author

johnlockejrr commented Oct 15, 2024

@ephraimm, sure, I'm right now collecting as much ground-truth/datasets of very good quality for Hebrew. In my case, I mostly work with ancient/medieval Hebrew (biblical, rabbinical, responsa etc. stuff) but Ivrit also. We can collaborate if you want.

@ephraimm
Copy link

@johnlockejrr we're after the same thing.
My end game is to digitise printed (chassidic) texts that usually have complex section breaks and columns. I have had some success combining tesseract with google vision, but this is not ideal and would like to find an open source solution.

tesseract is good at recognising blocks, but not good at distinguishing between actual text and subscript references to footnotes and punctuation. google on the other hand is great at accurately recognising the characters but doesn't recognise column breaks etc.

My current hypothesis is if I give tesseract a good quality sample of text from google and fine tune the existing Hebrew model we'll get the results.

What makes you think paddle will work? Have you tried fine tuning tesseract?

@johnlockejrr
Copy link
Contributor Author

johnlockejrr commented Oct 15, 2024

Then take a look at eScriptorium that uses kraken but also can use tesseract (that is not actively developped anymore).
Printed texts are piece of cake for kraken, though you have to train a good segmentation (text and layers detection) model. I mostly work with manuscripts, HTR is a little more difficult than printed texts.

Drop me an email [at] gmail.com

@ephraimm
Copy link

@johnlockejrr
sent to gee male. you can do likewise

@luotao1
Copy link
Collaborator

luotao1 commented Nov 1, 2024

@johnlockejrr Could you provide your email? We need this information when mailing gifts. Please send your email to ext_paddle_oss@baidu.com or here. Thanks very much!

@johnlockejrr
Copy link
Contributor Author

@luotao1 is: johnlockejrr [at] gmail.com

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants