Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vocabs] Extend the list of predefined vocabularies #1883

Open
6 tasks
felixdittrich92 opened this issue Mar 5, 2025 · 3 comments
Open
6 tasks

[vocabs] Extend the list of predefined vocabularies #1883

felixdittrich92 opened this issue Mar 5, 2025 · 3 comments
Assignees
Labels
ext: docs Related to docs folder good first issue Good for newcomers module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: enhancement Improvement
Milestone

Comments

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Mar 5, 2025

🚀 The feature

If we want to train / provide multilingual recognition models we need to extend our predefined vocabularies

reference PR: https://github.com/mindee/doctr/pull/1355/files

  • russian
  • bulgarian
  • japanese
  • chinese (simplified)
  • korean
  • ...

For example https://github.com/eymenefealtun/all-words-in-all-languages could be used to extract language specific charsets

The current multilingual vocabs entry can be extended with the new created language entries to provide a deduplicated list of a most complete multilingual char representation

latin_extended (german, spanish, czech, and so on), cyrillic and hebrew should be really low hanging fruits to include for training

ressources:

https://sites.google.com/site/worldfactsinc/Non-Latin-Script-Languages-Of-The-World

https://www.omniglot.com/writing/langalph.htm

@felixdittrich92 felixdittrich92 added the type: enhancement Improvement label Mar 5, 2025
@felixdittrich92
Copy link
Contributor Author

Already done:

  • latin
  • english
  • french
  • portuguese
  • spanish
  • italian
  • german
  • arabic
  • czech
  • polish
  • dutch
  • norwegian
  • danish
  • finnish
  • swedish
  • vietnamese
  • hebrew
  • hindi
  • gujarati
  • bangla
  • ukrainian

@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation module: datasets Related to doctr.datasets ext: docs Related to docs folder good first issue Good for newcomers labels Mar 5, 2025
@felixdittrich92 felixdittrich92 added this to the 0.12.0 milestone Mar 5, 2025
@sarjil77
Copy link
Contributor

sarjil77 commented Mar 8, 2025

hey @felixdittrich92 ,

i am working on this.

@felixdittrich92
Copy link
Contributor Author

hey @felixdittrich92 ,

i am working on this.

@sarjil77 Thanks sounds great 👍
One PR / language please :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: docs Related to docs folder good first issue Good for newcomers module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

3 participants