-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many bugs in training the legacy engine #3925
Comments
Thank you for this detailled report. So you are training a legacy model? That is indeed rarely done as most people (including myself) typically train LSTM models. It would help if you could describe the single steps which are necessary to reproduce the failures. Ideally we should create unit tests then to avoid future regressions. |
Unit testing is not enough, we should do real world testing on thousands of pages to test the layout analysis and the two OCR engines. BTW, there was a report by @tfmorris about a huge drop in speed and accuracy that occurred between version 3.02 and version 3.03 (and some later versions). Nobody did anything to find out the source of the regression. I also read a report on a drop in accuracy of the layout analysis that occurred between 3.04 and 4.0. I don't have a reference to that report. There were also some general reports (without much details) about a drop in accuracy that occurred between 4.x and 5.0. |
I'm not sure which is the legacy method which is not, I'm working for industry application program thus I MUST use c++ only. I don't use tesstrain since it seems works in PYTHON environment and cannot be deployed in c++. And I don't find any real step-by-step training guidance for latest versions.
|
(1)
(2) tesseract/src/classify/intproto.cpp Lines 507 to 508 in 74e226b
tesseract/src/ccstruct/fontinfo.cpp Lines 222 to 224 in 5a36943
(3) tesseract/src/ccutil/unicity_table.h Lines 76 to 77 in 839f528
|
Can you please see if the suggested changes can be applied? |
I still try to reproduce the bugs locally. |
Were you able to reproduce the reported bugs? |
No, not up to now. |
I did a miniature reproduction part of the problem (crash of shapeclustering) for those who want to dig into this problem: I also find an old version of tesseract 3.05.02, which is able to create shapetable from this example. The steps for reproducing are quite simple:
|
If the training tools for the the legacy are broken and nobody will fix it in time for the 5.3.0 release, I suggest to modify cmake, sw and autotools to not compile and install the legacy training tools. |
I already pointed to that commit in #3925 (comment) |
Fixes: cac116d ("Replace more PointerVector by std::vector [...]") Signed-off-by: Stefan Weil <sw@weilnetz.de>
@SpaceView, pull request #3970 fixes the issue in my test. Perhaps you can try it and confirm whether it works for you, too. |
Fixed in #3970. |
I am afraid this issue is not solved fully. This set of commands works for me with tesseract 3.05.02 (to be sure how the process should look like): tesseract num.ocra.exp0.png num.ocra.exp0 nobatch box.train
unicharset_extractor num.ocra.exp0.box
set_unicharset_properties -U unicharset -O num.unicharset --script_dir=langdata/
shapeclustering -F font_properties -U num.unicharset num.ocra.exp0.tr
mftraining -F font_properties -U num.unicharset -O num.unicharset num.ocra.exp0.tr
cntraining num.ocra.exp0.tr
mv inttemp num.inttemp
mv pffmtable num.pffmtable
mv normproto num.normproto
mv shapetable num.shapetable
combine_tessdata num.
mkdir tessdata
mv num.traineddata tessdata
tesseract num.ocra.exp0.png - --psm 7 -l num --tessdata-dir . However
Unfortunately, I do not have time to test the other version mentioned by the reporter. |
I found some spare time for testing are here are some observations:
IMO it would be good to create small test case also for LSTM training to checks if the output is similar as of 5.0.0-alpha. |
I'm afraid that the changes 51909d5...36f9131 at least contribute to the regression. Extract from old
Extract from new
The old code used Related functions: |
Fixes: 3b07599 ("Replace more STRING by std::string") Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed because the returned value was 1 instead of 0 for the first call of UnicityTable::push_back. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining with fs.size() == 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining because unicharset_size in file "inttemp" was written with 8 bytes instead of 4 bytes. Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes duplicate delete when running cntraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
…ract-ocr#3925) It is required for mftraining which otherwise writes a wrong shapetable. Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old code did not work correctly if FClass->font_set.size() was 0. It created the FontSet fs with size 1 instead of 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was triggered by mftraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed if the search did not find anything. Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed because the returned value was 1 instead of 0 for the first call of UnicityTable::push_back. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining with fs.size() == 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining because unicharset_size in file "inttemp" was written with 8 bytes instead of 4 bytes. Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes duplicate delete when running cntraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
@SpaceView, hopefully the really many bugs which you found and reported are fixed by the many commits in pull request #3977. Some of those commits are nearly identical to your proposed code changes. |
It is required for mftraining which otherwise writes a wrong shapetable. Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old code did not work correctly if FClass->font_set.size() was 0. It created the FontSet fs with size 1 instead of 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was triggered by mftraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed if the search did not find anything. Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed because the returned value was 1 instead of 0 for the first call of UnicityTable::push_back. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining with fs.size() == 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining because unicharset_size in file "inttemp" was written with 8 bytes instead of 4 bytes. Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes duplicate delete when running cntraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
…ract-ocr#3925) It is required for mftraining which otherwise writes a wrong shapetable. Signed-off-by: Stefan Weil <sw@weilnetz.de> # Conflicts: # src/ccutil/helpers.h
The old code did not work correctly if FClass->font_set.size() was 0. It created the FontSet fs with size 1 instead of 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was triggered by mftraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed if the search did not find anything. Signed-off-by: Stefan Weil <sw@weilnetz.de>
mftraining crashed because the returned value was 1 instead of 0 for the first call of UnicityTable::push_back. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining with fs.size() == 0. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It crashed when running mftraining because unicharset_size in file "inttemp" was written with 8 bytes instead of 4 bytes. Signed-off-by: Stefan Weil <sw@weilnetz.de> # Conflicts: # src/classify/intproto.cpp
This fixes duplicate delete when running cntraining. Signed-off-by: Stefan Weil <sw@weilnetz.de>
I doubt anybody have successfully trained custom data with tesseract 5.2.0 and 5.1.0, the latest I can succeed is 5.0.0-alpha-20201224.
Below are some BUGs when I'm running tesseract 5.2.0 for custom data training. I can say there are TOO MANY BUGS, thus I was not able to finish the whole training due to limited time at this moment, below are just a few of the found BUGs for reference.
I changed the above code and can get "shapeclustering.exe" and "mftraining.exe" to run smoothly, all training materail such as "inttemp" and "pffmtable" are well generated.
Currently the cntraining.exe will crash, but I don't have any more time to test.
The text was updated successfully, but these errors were encountered: