-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathreadme_nl.html
122 lines (122 loc) · 9.41 KB
/
readme_nl.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>VietOCR - Java GUI Frontend voor Tesseract OCR</title>
</head>
<body>
<div class="Section1">
<h2 align="center">VietOCR</h2>
<h3>BESCHRIJVING</h3>
<p><a href="http://vietocr.sourceforge.net">VietOCR</a> is a Java GUI frontend for
<a href="https://github.com/tesseract-ocr">Tesseract OCR engine</a>, providing
character recognition support for common image formats, and multi-page images. The
program has postprocessing which helps correct errors regularly encountered in the
OCR process, boosting the accuracy rate on the result. The program can also function
as a console application, executing from the command line.</p>
<p>Het verwerken van meerdere bestanden wordt nu ondersteund. Het programma controleert
de bronmap op nieuwe bestanden en verwerkt deze automatisch via de OCR engine.
Het resultaat komt na de letterherkenning in de doelmap terecht.</p>
<h3>SYSTEEMVEREISTEN</h3>
<p><a href="https://www.oracle.com/java/technologies/downloads/">Java Runtime
Environment 8</a> or later. On Windows, <a href="https://docs.microsoft.com/en-US/cpp/windows/latest-supported-vc-redist">Microsoft Visual C++ 2022 Redistributable Package</a> is also required.</p>
<h3>INSTALLATIE</h3>
<p>Tesseract Windows executable is bundled with the program. Additional <a href="https://github.com/tesseract-ocr/tessdata">
language data packs</a> for Tesseract, whose names start with ISO639-3 codes,
should be placed into the <code>tessdata</code> subdirectory.</p>
<p>Voor Linux zijn Tesseract en de taaldata pakketten ook verkrijgbaar in de grafische (universele) repository. Je kunt ze installeren via Synaptic of door het volgende commando:</p>
<blockquote>
<p><code>sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-vie</code></p>
</blockquote>
<p>The files will be placed in <code>/usr/bin</code> and <code>/usr/share/tesseract-ocr/tessdata</code>,
respectively. On the other hand, if Tesseract is built and installed from the <a href="https://github.com/tesseract-ocr/tesseract/wiki">source</a>,
they will be placed in <code>/usr/local/bin</code> and <code>/usr/local/share/tessdata</code>.
You can also let VietOCR know the location
of <code>tessdata</code> via the environment variable <code>TESSDATA_PREFIX</code>:</p>
<blockquote>
<p><code>export TESSDATA_PREFIX=/usr/local/share/</code></p>
</blockquote>
<p>For other platforms, please consult <a href="https://github.com/tesseract-ocr/tesseract/wiki">
Tesseract Wiki</a> page.</p>
<p>VietOCR also provides support for downloading and installing selected language packs
via <em>Download Language Data</em> menu item. Depending on the location of the
<code>tessdata</code> folder, you may be required to run the program as root or
admin to be able to install the downloaded data into the folder if it is inside
a system folder, such as in <code>/usr</code> on Linux or <code>C:\Program Files</code>
on Windows.</p>
<p>Scanning support on Windows is provided via the Windows Image
Acquisition Library v2.0.</p>
<p>Om gebruik te maken van de scanfunctionaliteit zijn de volgende SANE pakketten vereist op Linux systemen:</p>
<blockquote>
<p><code>sudo apt-get install libsane sane sane-utils libsane-extras xsane</code></p>
</blockquote>
<p>PDF support is possible via PDFBox.</p>
<p>Spellcheck functionality is available through Hunspell, whose <a href="http://wiki.services.openoffice.org/wiki/Dictionaries">
dictionary</a> files (<code>.aff</code>, <code>.dic</code>) should be placed
in <code>dict</code> folder of VietOCR. <code>user.dic</code> is an UTF-8-encoded
file which contains a list of custom words, one word per line.</p>
<p>Op Linux systemen kunnen Hunspell en bijbehorende woordenboeken worden geïnstalleerd door middel van Synaptic of <code>apt</code> als volgt:</p>
<blockquote><code>sudo apt-get install hunspell hunspell-en-us</code></blockquote>
<h3>INSTRUCTIES</h3>
<p>Voer het volgende commando uit om het programma te starten:</p>
<blockquote>
<p><code>java -jar VietOCR.jar</code></p>
</blockquote>
<p><b><u>Opmerking</u></b>: Als u geheugen problemen ervaart kunt u beter het script <code>ocr</code> gebruiken in plaats van het JAR bestand.</p>
<p>The Vietnamese language data were generated for Times New Roman, Arial, Verdana,
and Courier New fonts. Therefore, the recognition would have better success rate
for images having similar font glyphs. OCRing images that have font glyphs look
different from the supported fonts generally will require <a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">
training</a> Tesseract to create another language data pack specifically for
those typefaces. Language data for some VNI and TCVN3 (ABC) fonts have also been
bundled in latest versions.</p>
<p>Images to be OCRed should be scanned at resolution from at least 200 DPI (dot per
inch) to 400 DPI in monochrome (black&white) or grayscale. Scanning at higher
resolutions will not necessarily result in better recognition accuracy, which currently
can be higher than 97% for Vietnamese, and the next release of Tesseract may improve
it even further. Even so, the actual rates still depend greatly on the quality of
the scanned image. The typical settings for scanning are 300 DPI and 1 bpp (bit
per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format.</p>
<p>De schermafbeelding modus zorgt voor betere herkenning bij afbeeldingen met een lage resolutie zoals schermafbeeldingen. In Schermafbeelding modus worden afbeeldingen geschaald naar 300 DPI.</p>
<p>In addition to the built-in text postprocessing algorithm, you can add your own
custom text replacement scheme via a UTF-8-encoded tab-delimited text file named <code>x.DangAmbigs.txt</code>,
where x is the ISO639-3 language code. Both plain and Regex text replacements are supported.</p>
<p>You can put init-only and non-init control parameters in <code>tessdata/configs/tess_configs</code>
and <code>tess_configvars</code> files, respectively, to modify Tesseract's
behaviour.</p>
<p>Some built-in tools are provided to merge several images or PDF files into a single
one for convenient OCR operations, or to split a TIFF or PDF file into smaller ones
if it contains too many pages, which can cause out-of-memory exceptions.</p>
<h3>NABEWERKING</h3>
<p>The recognition errors can generally be classified into three categories. Many of
the errors are related to the letter cases — for example: hOa, nhắC — which can
be easily corrected by popular Unicode text editors. Many other errors are a result
of the OCR process, such as missing diacritical marks, wrong letters with similar
shape, etc. — huu – hưu, mang – marg, h0a – hoa, la – 1a, uhìu - nhìn. These can
also be easily fixed by spell checker programs. The built-in Postprocessing function
can help correct many of the aforementioned errors.</p>
<p>The last category of errors is the most difficult to detect because they are semantic
errors, which means that the words are valid entries in the dictionary but are wrong
in the context — e.g., tinh – tình, vân – vấn. These errors require the editor to
read though and manually correct them according to the original image.</p>
<p>Hieronder instructies over hoe de OCR fouten uit de eerste twee categorieën te corrigeren
zijn, met behulp van de ingebouwde functionaliteit:</p>
<ol style="margin-top: 0in" start="1" type="1">
<li>Group lines. The lines need to be grouped to the paragraph they belong, as being
OCRed, each line becomes a separate 1-line paragraph. Use <i>Remove Line Breaks</i>
function under <i>Format</i> menu. Note that this operation may not be needed for
poems.</li>
<li>Select <i>Change Case</i>, also under <i>Format</i> menu, and choose <i>Sentence
case</i> to correct most of the letter case errors. Locate and fix the rest of remaining
letter case errors.</li>
<li>Corrigeer de fout gespelde woorden met behulp van de geïntegreerde <i>spellingcontrole</i>.</li>
</ol>
<p>Through the above process, most of common errors can be eliminated. The remaining,
semantic errors are few, but it requires a human editor to read though and make
necessary edits to make the document like the original scanned document, and error-free
if desired.</p>
<p>Voor vragen kunt u een bericht achter laten op het <a href="http://sourceforge.net/projects/vietocr/forums">
VietOCR Forum</a>.</p>
<hr>
</div>
</body>
</html>