How to do incremental-training on tesseract-ocr? #391

zaryabRiasat · 2024-05-31T07:32:57Z

I'm working with tesseract-4.1.1 and trying to do training (fine-tuning) for this I have followed steps:

Downloaded eng.traineddata from tessdata_best and pasted it into /usr/share/tesseract-ocr/4.00/tessdata.
Then I've created image-crops using craft-text-detector in python and made ground-truths (.gt.txt) for each image crop.
Then cloned git clone https://github.com/tesseract-ocr/ocrd-train.git and then cd ocrd-train.
Inside ocrd-train/data folder, I've created my-model-ground-truth folder and pasted .png and .gt.txt files in it.
Then I ran command make tesseract-langdata on terminal.
At last I ran command make training MODEL_NAME=my-model MAX_ITERATIONS=20000 PSM=7 FINETUNE_TYPE=Impact DEBUG_INTERVAL=-1 START_MODEL=eng TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata/

Above procedure took some time, and I got my-model.traineddata file in ocrd-train/data/. I've pasted that file in /usr/share/tesseract-ocr/4.00/tessdata and it is giving results better than eng.traineddata.

For above training I used 20 images, now I want to do incremental-training. I want to train 30 more images on previously trained my-model.traineddata. Here I'm confused because after completion of previous training there are some folder in ocrd-train/data/:

my-model (folder)
my-model-ground-truth (folder)
eng (folder)
langdata (folder)
my-model.traineddata (file)

Now what should I do for incremental-training?

Do I only need to remove files in my-model-ground-truth and paste new .png and .gt.txt files of 30 images, and use my-model as START_MODEL?

Or I need to remove other folders as well?

The text was updated successfully, but these errors were encountered:

stweil · 2024-05-31T07:39:08Z

Are you using very old instructions (old Tesseract release, old repository URL, ...)?

zaryabRiasat · 2024-05-31T09:30:34Z

@stweil Thank You for your response.

Yes I'm using tesseract-4.1.1, Old Repository.

First time training is working fine with START_MODEL=eng, but I am unable to do incremental training as mentioned in above details.

zaryabRiasat · 2024-05-31T09:42:34Z

@stweil I just want to know, how can I do incremental-training on my existing trained model?

What steps I should follow?

zdenop · 2024-05-31T14:09:19Z

What about reading Tesseract documentation and Readme of this repository?

stweil · 2024-05-31T16:28:03Z

@zaryabRiasat, the first step is using a recent software release instead of an old one and also reading the current documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do incremental-training on tesseract-ocr? #391

How to do incremental-training on tesseract-ocr? #391

zaryabRiasat commented May 31, 2024

stweil commented May 31, 2024

zaryabRiasat commented May 31, 2024 •

edited

Loading

zaryabRiasat commented May 31, 2024 •

edited

Loading

zdenop commented May 31, 2024

stweil commented May 31, 2024

How to do incremental-training on tesseract-ocr? #391

How to do incremental-training on tesseract-ocr? #391

Comments

zaryabRiasat commented May 31, 2024

stweil commented May 31, 2024

zaryabRiasat commented May 31, 2024 • edited Loading

zaryabRiasat commented May 31, 2024 • edited Loading

zdenop commented May 31, 2024

stweil commented May 31, 2024

zaryabRiasat commented May 31, 2024 •

edited

Loading

zaryabRiasat commented May 31, 2024 •

edited

Loading