Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
Readme.md		Readme.md
hocr.png		hocr.png
out.xml		out.xml
test_image2.png		test_image2.png

Readme.md

tesseract 4.0.0 alpha

OCR Engine modes:

Original Tesseract only.
Neural nets LSTM only.
Tesseract + LSTM.
Default, based on what is available.

Install at windows

Downoload zip from 4.0.0-alpha for Windows
unzip to dir tesseract-4.0.0-alpha
Download language data from here. (chi_sim.traineddata for Simplified Chinese , eng.traineddata for English).
osd.traineddata is necessity
mkdir tesseract-4.0.0-alpha/tessdata
set TESSDATA_PREFIX environment variable to the parent directory of your "tessdata" directory
Copy *.traineddata to tesseract-4.0.0-alpha/tessdata

Run tesseract

Simple OCR

English text recognition from test_image2.png

\tesseract-4.0.0-alpha>tesseract.exe test_image2.png 1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

Then we get the recoginized english test at 1.txt

Simple OCR for Chinese

Chinese text recognition from test_image3.png

tesseract-4.0.0-alpha>tesseract.exe test_image3.png out -l chi_sim
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

Then we get the recoginized Chinese test at 1.txt

Get character location

copy configs/hocr to tesseract-4.0.0-alpha/tessdata/configs/hocr
run cmd below to get the hocr file

\tesseract-4.0.0-alpha>tesseract.exe test_image2.png out hocr
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
OSD: Weak margin (1.71) for 78 blob text block, but using orientation anyway: 0

then we get a file out.hocr (each word's location is in this file)
renam out.hocr to out.xml (hocr is different from page xml format, this convertiong will loss some information, but we can still get text region visualization directly )
download PAGEViewer from http://www.prima.cse.salford.ac.uk/tools/PAGEViewer
open out.xml and test_image2.png by PAGEViewer, so we get below pic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic_usage

basic_usage

Readme.md

tesseract 4.0.0 alpha

Install at windows

Run tesseract

Simple OCR

Simple OCR for Chinese

Get character location

Files

basic_usage

Directory actions

More options

Directory actions

More options

Latest commit

History

basic_usage

Folders and files

parent directory

Readme.md

tesseract 4.0.0 alpha

Install at windows

Run tesseract

Simple OCR

Simple OCR for Chinese

Get character location