Merge google code branch https://code.google.com/r/email-hocr-tsv #18

jimregan · 2015-05-13T21:31:15Z

Requested in https://code.google.com/p/tesseract-ocr/issues/detail?id=1378

…ext(int).

tfmorris · 2016-02-01T18:59:02Z

The original issue tracker is gone, but there's an archived version here:
https://web.archive.org/web/20150413012229/https://code.google.com/p/tesseract-ocr/issues/detail?id=1378

Basically the request is to output the information contained in a hOCR file in tabular TSV format.

Shreeshrii · 2016-03-01T08:43:27Z

Can this be merged to provide support for tables?

Thanks!

tfmorris · 2016-03-01T16:58:06Z

What is the use case for this? I can't find any earlier discussion. As far as I can tell, all the information is included already in the hOCR output (more actually since it host LTR/RTL, italic/bold, etc) -- and, of course, even more info is available programmatically through the API.

Here's some example output: http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv
archive: https://web.archive.org/web/20160201190446/http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv

tfmorris · 2016-03-01T18:50:55Z

I've created a cleaned up version of this code in #245. I'm not really happy about adding even more crap to baseapi.cpp, but I've got a separate branch to refactor the hOCR renderer out of it, so I can add the TSV renderer to that, if it's decided to include it in Tess.

stweil · 2016-03-01T18:59:19Z

Wouldn't it be easier to keep the TSV code out of the Tesseract code and to provide a standalone script which does a transformation from hOCR to TSV? Such a script could also be used with hOCR generated by other tools.

Shreeshrii · 2016-03-02T06:55:18Z

Link for one of the earlier requests

https://groups.google.com/forum/m/#!topic/tesseract-issues/-QOvWLrsjfI

sent from my phone. excuse the brevity.
On 01-Mar-2016 10:28 pm, "Tom Morris" notifications@github.com wrote:

What is the use case for this? I can't find any earlier discussion. As far
as I can tell, all the information is included already in the hOCR output
(more actually since it host LTR/RTL, italic/bold, etc) -- and, of course,
even more info is available programmatically through the API.

Here's some example output:
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv
archive:
https://web.archive.org/web/20160201190446/http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv

—
Reply to this email directly or view it on GitHub
#18 (comment)
.

tfmorris · 2016-03-02T15:51:44Z

The earlier issue mentioned is at: https://web.archive.org/web/20151128094905/http://code.google.com/p/tesseract-ocr/issues/detail?id=918

Basically it posits that TSV output as a (partial?) solution to table layout analysis. I think it's a bit more involved that that, but I have no strong feelings one way or the other on adding this.

Pros:

provides a simpler format for consumers than parsing HTML
not really that big: 1 API call, 1 config variable, <200 lines code
having it directly supports eliminates the need for external helper scripts

Cons:

largely duplicates functionality available in hOCR output
one more place to update if new information gets added to the output
downstream consumers are going to be custom programs, so they could integrate HTML parsing instead of TSV parsing (with a small increase in complexity)

Like I said, I'm neutral. I'll let others argue yea or nay.

Shreeshrii · 2016-03-02T16:02:49Z

Thanks Tom, for listing out the pros and cons for tsv.

As a user, I support having a simpler format of output without external
scripts :-)

Regarding the duplication of functionality, is it not possible to use a
common routine and then branch off based on required output format.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Mar 2, 2016 at 9:22 PM, Tom Morris notifications@github.com wrote:

The earlier issue mentioned is at:
https://web.archive.org/web/20151128094905/http://code.google.com/p/tesseract-ocr/issues/detail?id=918

Basically it posits that TSV output as a (partial?) solution to table
layout analysis. I think it's a bit more involved that that, but I have no
strong feelings one way or the other on adding this.

Pros:

provides a simpler format for consumers than parsing HTML

not really that big: 1 API call, 1 config variable, <200 lines code

having it directly supports eliminates the need for external helper
scripts

Cons:

largely duplicates functionality available in hOCR output

one more place to update if new information gets added to the output

downstream consumers are going to be custom programs, so they could
integrate HTML parsing instead of TSV parsing (with a small increase in
complexity)

Like I said, I'm neutral. I'll let others argue yea or nay.

—
Reply to this email directly or view it on GitHub
#18 (comment)
.

Add TSV result renderer. Fixes tesseract-ocr#18

sundarcf and others added 10 commits August 19, 2014 11:33

Adds char* GetHOCRTSVText(int) as placeholder. Copy of char* GetHOCRT…

43b9320

…ext(int).

Adds TessHOcrTsvRenderer class for rendering HOCR info in tsv format.

fa17737

Calls TessHOcrTsvRenderer if tessedit_create_hocrtsv is true.

ac5798b

Adds hocrtsv file to configs folder.

84c3a5d

Adds hocrtsv to tessdata/configs/Makefile.am

13e11b5

Adds BoolParam tessedit_create_hocrtsv in class Tesseract.

f40c06f

Render output in TSV format.

099651b

Merge remote-tracking branch 'upstream/master'

d60635e

Avoids HTML escaping.

b7c1d81

merge

5ac1cf5

jimregan added the feature request label May 18, 2015

tfmorris mentioned this pull request Mar 1, 2016

Add TSV result renderer #245

Merged

zdenop closed this in d55f5fb Mar 3, 2016

jimregan deleted the email-hocr-tsv branch October 17, 2016 17:50

Shreeshrii mentioned this pull request May 25, 2018

Segmentation fault OCRing a washed out image #1601

Closed

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021

Merge pull request tesseract-ocr#245 from tfmorris/result_renderer_tsv

52ff1ee

Add TSV result renderer. Fixes tesseract-ocr#18

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021

Merge pull request tesseract-ocr#245 from tfmorris/result_renderer_tsv

110f18c

Add TSV result renderer. Fixes tesseract-ocr#18

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021

Merge pull request tesseract-ocr#245 from tfmorris/result_renderer_tsv

e095b75

Add TSV result renderer. Fixes tesseract-ocr#18

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021

Merge pull request tesseract-ocr#245 from tfmorris/result_renderer_tsv

b7eef67

Add TSV result renderer. Fixes tesseract-ocr#18

ChristianOsta mentioned this pull request May 16, 2024

Floating-point exception (SIGFPE) due to out-of-range input to asinf in Wordrec::angle_change #4242

Closed

yeezy69 mentioned this pull request Jun 8, 2024

Floating point exception with tessdata models since version 5.4.0 #4257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge google code branch https://code.google.com/r/email-hocr-tsv #18

Merge google code branch https://code.google.com/r/email-hocr-tsv #18

jimregan commented May 13, 2015

tfmorris commented Feb 1, 2016

Shreeshrii commented Mar 1, 2016

tfmorris commented Mar 1, 2016

tfmorris commented Mar 1, 2016

stweil commented Mar 1, 2016

Shreeshrii commented Mar 2, 2016

tfmorris commented Mar 2, 2016

Shreeshrii commented Mar 2, 2016

Merge google code branch https://code.google.com/r/email-hocr-tsv #18

Merge google code branch https://code.google.com/r/email-hocr-tsv #18

Conversation

jimregan commented May 13, 2015

tfmorris commented Feb 1, 2016

Shreeshrii commented Mar 1, 2016

tfmorris commented Mar 1, 2016

tfmorris commented Mar 1, 2016

stweil commented Mar 1, 2016

Shreeshrii commented Mar 2, 2016

tfmorris commented Mar 2, 2016

Shreeshrii commented Mar 2, 2016