Tesseract trained on handwriting data - when being tested it only outputs the letter "e" #395

KPLinux · 2024-07-15T01:16:52Z

Trained using this dataset: https://www.kaggle.com/datasets/nibinv23/iam-handwriting-word-database/data

I am creating an OCR model that is meant to recognize human handwriting. I extracted the image files and created separate ground truth text files for each image and followed the training process as explained in the README. The makefile ran well and no errors occurred except for one corrupted file that was unable to be read, so I just deleted it and continued the training, after which the process terminated successfully.

However, when I tried testing tesseract by having it run on some testing handwriting samples, it always gave back "e". Just to check if it's not an issue with the engine itself, I tested it with the english traineddata and (while it returned gibberish) it worked fine in the sense that each image returned a different value. But my trained model only outputs "e" for every input image.

Is there a way to fix this? I am quite new to AI and ML and just programming in general, so any tips/suggestions would be appreciated

stweil · 2024-07-15T06:35:53Z

Tesseract's layout detection is not able to separate the text lines in handwritten text. It was only designed for printed text. Therefore your newly trained model would work for single lines (with the right --psm parameter), but not for typical handwritten text.

Use kraken or other software which supports text recognition for handwritings.

KPLinux · 2024-07-15T13:48:33Z

Thanks for the info. The model I am training is only meant to detect single lines of text, not multi-line sentences or paragraphs, so I thought tesstrain would help me train a custom Tesseract model for this purpose.

stweil · 2024-07-15T15:42:53Z

Then try --psm 7 or --psm 13 and pass the line image to Tesseract.

KPLinux · 2024-07-15T15:58:39Z

I tried both, they still return "e" and nothing else. I guess I'll have to learn how to use kraken, unless there's some other method to go about this. Either way, thanks for your help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract trained on handwriting data - when being tested it only outputs the letter "e" #395

Tesseract trained on handwriting data - when being tested it only outputs the letter "e" #395

KPLinux commented Jul 15, 2024 •

edited

Loading

stweil commented Jul 15, 2024

KPLinux commented Jul 15, 2024

stweil commented Jul 15, 2024

KPLinux commented Jul 15, 2024

Tesseract trained on handwriting data - when being tested it only outputs the letter "e" #395

Tesseract trained on handwriting data - when being tested it only outputs the letter "e" #395

Comments

KPLinux commented Jul 15, 2024 • edited Loading

stweil commented Jul 15, 2024

KPLinux commented Jul 15, 2024

stweil commented Jul 15, 2024

KPLinux commented Jul 15, 2024

KPLinux commented Jul 15, 2024 •

edited

Loading