OCR Engine modes:
- Original Tesseract only.
- Neural nets LSTM only.
- Tesseract + LSTM.
- Default, based on what is available.
- Downoload zip from 4.0.0-alpha for Windows
- unzip to dir
tesseract-4.0.0-alpha
- Download language data from here. (
chi_sim.traineddata
for Simplified Chinese ,eng.traineddata
for English). - osd.traineddata is necessity
- mkdir
tesseract-4.0.0-alpha/tessdata
- set TESSDATA_PREFIX environment variable to the parent directory of your "tessdata" directory
- Copy
*.traineddata
totesseract-4.0.0-alpha/tessdata
- English text recognition from test_image2.png
\tesseract-4.0.0-alpha>tesseract.exe test_image2.png 1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Then we get the recoginized english test at 1.txt
- Chinese text recognition from test_image3.png
tesseract-4.0.0-alpha>tesseract.exe test_image3.png out -l chi_sim
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Then we get the recoginized Chinese test at 1.txt
-
copy
configs/hocr
totesseract-4.0.0-alpha/tessdata/configs/hocr
-
run cmd below to get the hocr file
\tesseract-4.0.0-alpha>tesseract.exe test_image2.png out hocr
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
OSD: Weak margin (1.71) for 78 blob text block, but using orientation anyway: 0
- then we get a file out.hocr (each word's location is in this file)
- renam out.hocr to out.xml (hocr is different from page xml format, this convertiong will loss some information, but we can still get text region visualization directly )
- download PAGEViewer from http://www.prima.cse.salford.ac.uk/tools/PAGEViewer
- open out.xml and test_image2.png by PAGEViewer, so we get below pic