Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge google code branch https://code.google.com/r/email-hocr-tsv #18

Closed
wants to merge 10 commits into from

Conversation

jimregan
Copy link
Contributor

@tfmorris
Copy link
Contributor

tfmorris commented Feb 1, 2016

The original issue tracker is gone, but there's an archived version here:
https://web.archive.org/web/20150413012229/https://code.google.com/p/tesseract-ocr/issues/detail?id=1378

Basically the request is to output the information contained in a hOCR file in tabular TSV format.

@Shreeshrii
Copy link
Collaborator

Can this be merged to provide support for tables?

Thanks!

@tfmorris
Copy link
Contributor

tfmorris commented Mar 1, 2016

What is the use case for this? I can't find any earlier discussion. As far as I can tell, all the information is included already in the hOCR output (more actually since it host LTR/RTL, italic/bold, etc) -- and, of course, even more info is available programmatically through the API.

Here's some example output: http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv
archive: https://web.archive.org/web/20160201190446/http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv

@tfmorris tfmorris mentioned this pull request Mar 1, 2016
@tfmorris
Copy link
Contributor

tfmorris commented Mar 1, 2016

I've created a cleaned up version of this code in #245. I'm not really happy about adding even more crap to baseapi.cpp, but I've got a separate branch to refactor the hOCR renderer out of it, so I can add the TSV renderer to that, if it's decided to include it in Tess.

@stweil
Copy link
Member

stweil commented Mar 1, 2016

Wouldn't it be easier to keep the TSV code out of the Tesseract code and to provide a standalone script which does a transformation from hOCR to TSV? Such a script could also be used with hOCR generated by other tools.

@Shreeshrii
Copy link
Collaborator

Link for one of the earlier requests

https://groups.google.com/forum/m/#!topic/tesseract-issues/-QOvWLrsjfI

What is the use case for this? I can't find any earlier discussion. As far
as I can tell, all the information is included already in the hOCR output
(more actually since it host LTR/RTL, italic/bold, etc) -- and, of course,
even more info is available programmatically through the API.

Here's some example output:
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv
archive:
https://web.archive.org/web/20160201190446/http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv


Reply to this email directly or view it on GitHub
#18 (comment)
.

@tfmorris
Copy link
Contributor

tfmorris commented Mar 2, 2016

The earlier issue mentioned is at: https://web.archive.org/web/20151128094905/http://code.google.com/p/tesseract-ocr/issues/detail?id=918

Basically it posits that TSV output as a (partial?) solution to table layout analysis. I think it's a bit more involved that that, but I have no strong feelings one way or the other on adding this.

Pros:

  • provides a simpler format for consumers than parsing HTML
  • not really that big: 1 API call, 1 config variable, <200 lines code
  • having it directly supports eliminates the need for external helper scripts

Cons:

  • largely duplicates functionality available in hOCR output
  • one more place to update if new information gets added to the output
  • downstream consumers are going to be custom programs, so they could integrate HTML parsing instead of TSV parsing (with a small increase in complexity)

Like I said, I'm neutral. I'll let others argue yea or nay.

@Shreeshrii
Copy link
Collaborator

Thanks Tom, for listing out the pros and cons for tsv.

As a user, I support having a simpler format of output without external
scripts :-)

Regarding the duplication of functionality, is it not possible to use a
common routine and then branch off based on required output format.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Mar 2, 2016 at 9:22 PM, Tom Morris notifications@github.com wrote:

The earlier issue mentioned is at:
https://web.archive.org/web/20151128094905/http://code.google.com/p/tesseract-ocr/issues/detail?id=918

Basically it posits that TSV output as a (partial?) solution to table
layout analysis. I think it's a bit more involved that that, but I have no
strong feelings one way or the other on adding this.

Pros:

  • provides a simpler format for consumers than parsing HTML
  • not really that big: 1 API call, 1 config variable, <200 lines code
  • having it directly supports eliminates the need for external helper
    scripts

Cons:

  • largely duplicates functionality available in hOCR output
  • one more place to update if new information gets added to the output
  • downstream consumers are going to be custom programs, so they could
    integrate HTML parsing instead of TSV parsing (with a small increase in
    complexity)

Like I said, I'm neutral. I'll let others argue yea or nay.


Reply to this email directly or view it on GitHub
#18 (comment)
.

@zdenop zdenop closed this in d55f5fb Mar 3, 2016
@jimregan jimregan deleted the email-hocr-tsv branch October 17, 2016 17:50
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants