-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Hebrew Language and Alphabet #13797
Conversation
Samaritan Script is RTL like Arabic and Hebrew, used for Samaritan Hebrew and Aramaic, sometimes has Arabic letters in some texts. https://en.wikipedia.org/wiki/Samaritan_(Unicode_block) https://en.wikipedia.org/wiki/Samaritan_Hebrew https://en.wikipedia.org/wiki/Samaritan_Aramaic_language
Samaritan Script is RTL like Arabic and Hebrew, used for Samaritan Hebrew and Aramaic, sometimes has Arabic letters in some texts. https://en.wikipedia.org/wiki/Samaritan_(Unicode_block) https://en.wikipedia.org/wiki/Samaritan_Hebrew https://en.wikipedia.org/wiki/Samaritan_Aramaic_language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the script failed beeing killed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix codestyle
Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'd like to test the hebrew support, but having trouble getting a compiled version working to test. could you point me in the right direction. many thanks |
I didn't train yet a model for Hebrew on PaddleOCR, got stuck on work with YOLOv8/PyLaia and kraken for Hebrew, and with Hebrew I mean classical Hebrew, medieval not Ivrit |
@johnlockejrr Thanks for your contribution! You will receive a beautiful PaddlePaddle gift. Please provide your mailing address by filling out the following questionnaire before October 18th. Looking forward to the future, we will walk further together in the world of open source! |
Is there anyway I can help. What are the next steps?
|
@ephraimm, sure, I'm right now collecting as much ground-truth/datasets of very good quality for Hebrew. In my case, I mostly work with ancient/medieval Hebrew (biblical, rabbinical, responsa etc. stuff) but Ivrit also. We can collaborate if you want. |
@johnlockejrr we're after the same thing. tesseract is good at recognising blocks, but not good at distinguishing between actual text and subscript references to footnotes and punctuation. google on the other hand is great at accurately recognising the characters but doesn't recognise column breaks etc. My current hypothesis is if I give tesseract a good quality sample of text from google and fine tune the existing Hebrew model we'll get the results. What makes you think paddle will work? Have you tried fine tuning tesseract? |
Then take a look at Drop me an email [at] gmail.com |
@johnlockejrr |
@johnlockejrr Could you provide your email? We need this information when mailing gifts. Please send your email to ext_paddle_oss@baidu.com or here. Thanks very much! |
@luotao1 is: johnlockejrr [at] gmail.com |
https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet
Hebrew is like Arabic a RTL Language.