-
Notifications
You must be signed in to change notification settings - Fork 23
Comparison of OCR formats
Philipp Zumstein edited this page Sep 28, 2019
·
1 revision
Version | Released | Specs | Schema | Samples |
---|---|---|---|---|
1.0 | December 2007 | - | - | |
1.1 | March 2010 | - |
Smallest unit: word
<span
class="ocrx_word"
id="word_1_33"
title="bbox 1584 1199 1997 1284; x_wconf 87"
lang="deu-frak"
dir="ltr"
>Verhältnisse.</span>
Version | Released | Specs | Schema | Samples |
---|---|---|---|---|
1.0 | December 02, 2004 | - | XSD | - |
2.0 | January 11, 2010 | - | XSD | |
2.1 | February 20, 2014 | - | XSD | |
3.0 | August, 2014 | - | XSD | - |
3.1 | January, 2014 | - | XSD | - |
Version | Released | Specs | Schema | Samples | |
---|---|---|---|---|---|
6v1 | 2002? | - | XSD | - | - |
8v2 | 2006? | - | XSD | - | - |
9v1 | 2007? | - | XSD | - | - |
10v1 | 2011? | XSD |
hOCR | ALTO | ABBYY | |
---|---|---|---|
Page |
<div class="ocr_page"> |
<Page> |
<page> |
Text Area / Column |
<div class="ocr_carea"> <div class="ocrx_block"> |
<PrintSpace> |
|
Paragraph |
<div class="ocr_par"> |
<TextBlock STYLEREFS="..."> |
|
Text Line |
<div class="ocr_line"> |
<TextLine> |
<line> <formatting>...</formatting> </line> |
Word |
<div class="ocrx_word"> |
<TextLine> |
<line> <formatting>...</formatting> </line> |
hOCR | ALTO | ABBYY |
---|---|---|
<div title="bbox 100 200 150 250"/> |
<String HEIGHT="250" WIDTH="150" VPOS="100" HPOS="200"/> |
<line l="200" t="100" r="1200" b="130"> |
hOCR | ALTO | ABBYY |
---|---|---|
­ Soft hyphens must be represented using the HTML entity.[1] Regular hyphenation characters are just dashes |
<HYP/> |
Level | hOCR | ALTO | ABBYY |
---|---|---|---|
Page | - |
<Page PC="0.743"> |
- |
Word |
<span class="ocrx_word" title="x_wconf 71>foo</span> |
<String WC="0.422"> |
- |
Character |
<span class="ocrx_word" title="x_wconf 71>foo</span> "if possible, convert word confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %)"[1] Not implemented in common engines? |
<String CC="0 0 4 0" CONTENT="luft"/> "Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character."[1] |