Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to read data/Assert failed #394

Open
T0biasCZe opened this issue Jun 6, 2024 · 4 comments
Open

Failed to read data/Assert failed #394

T0biasCZe opened this issue Jun 6, 2024 · 4 comments

Comments

@T0biasCZe
Copy link

When trying to fine tune model, i get Failed to read data errors and then assert failed error

C:\Users\tobik\source\repos\tesstrain>make training MODEL_NAME=ocrd-testset START_MODEL=ces TESSDATA=C:\tessdata
You are using make version: 4.4.1
combine_tessdata -u C:\tessdata/ces.traineddata data/ces/ocrd-testset
Extracting tessdata components from C:\tessdata/ces.traineddata
Wrote data/ces/ocrd-testset.lstm
Wrote data/ces/ocrd-testset.lstm-punc-dawg
Wrote data/ces/ocrd-testset.lstm-word-dawg
Wrote data/ces/ocrd-testset.lstm-number-dawg
Wrote data/ces/ocrd-testset.lstm-unicharset
Wrote data/ces/ocrd-testset.lstm-recoder
Wrote data/ces/ocrd-testset.version
Version:4.00.00alpha:ces:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx384O1c1]
17:lstm:size=7541987, offset=192
18:lstm-punc-dawg:size=322, offset=7542179
19:lstm-word-dawg:size=3366074, offset=7542501
20:lstm-number-dawg:size=2114, offset=10908575
21:lstm-unicharset:size=7028, offset=10910689
22:lstm-recoder:size=1111, offset=10917717
23:version:size=80, offset=10918828
unicharset_extractor --output_unicharset "data/ocrd-testset/my.unicharset" --norm_mode 2 "data/ocrd-testset/all-gt"
Extracting unicharset from plain text file data/ocrd-testset/all-gt
Other case W of w is not in unicharset
Other case O of o is not in unicharset
Other case R of r is not in unicharset
Other case I of i is not in unicharset
Other case U of u is not in unicharset
Other case E of e is not in unicharset
Other case G of g is not in unicharset
Other case k of K is not in unicharset
Other case V of v is not in unicharset
Other case Y of y is not in unicharset
Other case Z of z is not in unicharset
Other case J of j is not in unicharset
Wrote unicharset file data/ocrd-testset/my.unicharset
merge_unicharsets data/ces/ocrd-testset.lstm-unicharset data/ocrd-testset/my.unicharset "data/ocrd-testset/unicharset"
Loaded unicharset of size 123 from file data/ces/ocrd-testset.lstm-unicharset
Loaded unicharset of size 45 from file data/ocrd-testset/my.unicharset
Wrote unicharset file data/ocrd-testset/unicharset.
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006 --psm 13 lstm.train
python shuffle.py 0 "data/ocrd-testset/all-lstmf"
python generate_eval_train.py data/ocrd-testset/all-lstmf 0.90


dos2unix "data/ocrd-testset/ocrd-testset.numbers"
dos2unix: data/ocrd-testset/ocrd-testset.numbers: No such file or directory
dos2unix: Skipping data/ocrd-testset/ocrd-testset.numbers, not a regular file.
make: [Makefile:290: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
dos2unix "data/ocrd-testset/ocrd-testset.punc"
dos2unix: data/ocrd-testset/ocrd-testset.punc: No such file or directory
dos2unix: Skipping data/ocrd-testset/ocrd-testset.punc, not a regular file.
make: [Makefile:291: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
dos2unix "data/ocrd-testset/ocrd-testset.wordlist"
dos2unix: data/ocrd-testset/ocrd-testset.wordlist: No such file or directory
dos2unix: Skipping data/ocrd-testset/ocrd-testset.wordlist, not a regular file.
make: [Makefile:292: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
dos2unix "data/langdata/ocrd-testset/ocrd-testset.config"
dos2unix: data/langdata/ocrd-testset/ocrd-testset.config: No such file or directory
dos2unix: Skipping data/langdata/ocrd-testset/ocrd-testset.config, not a regular file.
make: [Makefile:293: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
combine_lang_model \
  --input_unicharset data/ocrd-testset/unicharset \
  --script_dir data/langdata \
  --numbers data/ocrd-testset/ocrd-testset.numbers \
  --puncs data/ocrd-testset/ocrd-testset.punc \
  --words data/ocrd-testset/ocrd-testset.wordlist \
  --output_dir data \
   \
  --lang ocrd-testset
Failed to read data from: data/ocrd-testset/ocrd-testset.wordlist
Failed to read data from: data/ocrd-testset/ocrd-testset.punc
Failed to read data from: data/ocrd-testset/ocrd-testset.numbers
Loaded unicharset of size 126 from file data/ocrd-testset/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/langdata/Inherited.unicharset
Config file is optional, continuing...
Failed to read data from: data/langdata/ocrd-testset/ocrd-testset.config
Null char=2
Created data/ocrd-testset/ocrd-testset.traineddata
lstmtraining \
  --debug_interval 0 \
  --traineddata data/ocrd-testset/ocrd-testset.traineddata \
  --old_traineddata C:\tessdata/ces.traineddata \
  --continue_from data/ces/ocrd-testset.lstm \
  --learning_rate 0.0001 \
  --model_output data/ocrd-testset/checkpoints/ocrd-testset \
  --train_listfile data/ocrd-testset/list.train \
  --eval_listfile data/ocrd-testset/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01 \
2>&1 | tee -a data/ocrd-testset/training.log
Loaded file data/ces/ocrd-testset.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 122 to 125!
old_mgr.Init(old_traineddata):Error:Assert failed:in file ../../../src/training/unicharset/lstmtrainer.cpp, line 132

lstmtraining \
--stop_training \
--continue_from data/ocrd-testset/checkpoints/ocrd-testset_checkpoint \
--traineddata data/ocrd-testset/ocrd-testset.traineddata \
--model_output data/ocrd-testset.traineddata
Failed to read continue from: data/ocrd-testset/checkpoints/ocrd-testset_checkpoint
make: *** [Makefile:325: data/ocrd-testset.traineddata] Error 1
@zdenop
Copy link
Contributor

zdenop commented Jun 7, 2024

What version of tesseract you use?

@stweil
Copy link
Collaborator

stweil commented Jun 11, 2024

I get a slightly different output and no crash when I try this on Debian GNU Linux:

$ lstmtraining \
  --debug_interval 0 \
  --traineddata data/ocrd-testset/ocrd-testset.traineddata \
  --old_traineddata ../tessdata_best/ces.traineddata \
  --continue_from data/ces/ocrd-testset.lstm \
  --learning_rate 0.0001 \
  --model_output data/ocrd-testset/checkpoints/ocrd-testset \
  --train_listfile data/ocrd-testset/list.train \
  --eval_listfile data/ocrd-testset/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01 \
2>&1 | tee -a data/ocrd-testset/training.log
Loaded file data/ces/ocrd-testset.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 122 to 131!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys64:64, 20736
  Lfx96:96, 61824
  RxLrx96:96, 74112
  Lfx384:384, 738816
  Fc131:131, 50435
Total weights = 946083
Previous null char=121 mapped to 130
Continuing from data/ces/ocrd-testset.lstm
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, mean rms=2.136%, delta=7.610%, BCER train=27.051%, BWER train=59.946%, skip ratio=0.000%, New best BCER = 27.051 wrote best model:data/ocrd-testset/checkpoints/ocrd-testset_27.051_100_100.checkpoint wrote checkpoint.
2 Percent improvement time=100, best error was 27.051 @ 100
At iteration 200/200/200, mean rms=1.956%, delta=6.367%, BCER train=24.516%, BWER train=54.783%, skip ratio=0.000%, New best BCER = 24.516 wrote best model:data/ocrd-testset/checkpoints/ocrd-testset_24.516_200_200.checkpoint wrote checkpoint.

@zdenop
Copy link
Contributor

zdenop commented Jun 11, 2024

I tried the recent code and 5.4.0 and I am not able to reproduce it.

tesseract -v
tesseract 5.4.0
 leptonica-1.84.2 (May 13 2024, 19:39:23) [MSC v.1929 LIB Release x64]
  libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 2.1.90) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.2.13.zlib-ng : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 200203

I have ICU version 74.2.

@briannicholas
Copy link

Had the same problem.

It's a windows issue. You need to specify the TESSDATA path using forward slashes

so for the op,

C:\Users\tobik\source\repos\tesstrain>make training MODEL_NAME=ocrd-testset START_MODEL=ces TESSDATA=C:/tessdata

rather than

TESSDATA=C:\tessdata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants