[vocabs] Extend the list of predefined vocabularies #1883
Labels
ext: docs
Related to docs folder
good first issue
Good for newcomers
module: datasets
Related to doctr.datasets
topic: documentation
Improvements or additions to documentation
type: enhancement
Improvement
Milestone
🚀 The feature
If we want to train / provide multilingual recognition models we need to extend our predefined vocabularies
reference PR: https://github.com/mindee/doctr/pull/1355/files
For example https://github.com/eymenefealtun/all-words-in-all-languages could be used to extract language specific charsets
The current
multilingual
vocabs entry can be extended with the new created language entries to provide a deduplicated list of a most complete multilingual char representationlatin_extended (german, spanish, czech, and so on), cyrillic and hebrew should be really low hanging fruits to include for training
ressources:
https://sites.google.com/site/worldfactsinc/Non-Latin-Script-Languages-Of-The-World
https://www.omniglot.com/writing/langalph.htm
The text was updated successfully, but these errors were encountered: