Training data generator for Text Detection and Text Recognition. The training data will be generated following the format specified by the various supported OCR systems. The supported OCR systems are:
At the moment the datasets that can be used to generate the training data are:
FUNSD
: https://guillaumejaume.github.io/FUNSD/IAM
: https://fki.tic.heia-fr.ch/databases/iam-handwriting-databaseSROIE
: https://paperswithcode.com/paper/icdar2019-competition-on-scanned-receipt-ocrXFUND
: https://github.com/doc-analysis/XFUND (de
,es
,fr
,it
,ja
,pt
,zh
)
Install the requirements:
pip3 install -r requirements.txt
To generate the training data check the ./config/config.json
first. This json file specifies:
output
: the output of the training data, stored in./output/
ocr-system
: the ocr system that will be trained, the choices aredoctr
,mmocr
,paddleocr
tasks
: specify if the training data is fordetection
,recognition
or both."tasks": ["det"] # only det "tasks": ["rec"] # only rec "tasks": ["det", "rec"] # both
datasets
: specify which datasets are going to be used for the generation of the training data. To select the dataset just set it toy
otherwise set it ton
, example below:"dataset1": "y", # selected "dataset2": { "sub1": "n", # not selected "sub2": "y" # selected }
When everything is set up just run:
python3 generate.py