Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ocr-transform page alto ... ...": loosing text #123

Closed
jbarth-ubhd opened this issue Feb 28, 2020 · 13 comments
Closed

"ocr-transform page alto ... ...": loosing text #123

jbarth-ubhd opened this issue Feb 28, 2020 · 13 comments

Comments

@jbarth-ubhd
Copy link

Example page generated with OCR-D ocrd-calamari-recognize
OCR_0007.zip

ocr-transform page hocr ... ... && ocr-transform hocr alto2.0 ... ... instead is loosing page size.

@jbarth-ubhd jbarth-ubhd changed the title ocr-transform page alto ... ... loosing text ocr-transform page alto ... ... loosing text Feb 28, 2020
@jbarth-ubhd jbarth-ubhd changed the title ocr-transform page alto ... ... loosing text ocr-transform page alto ... ... loosing text Feb 28, 2020
@jbarth-ubhd jbarth-ubhd changed the title ocr-transform page alto ... ... loosing text "ocr-transform page alto ... ...": loosing text Feb 28, 2020
@jbarth-ubhd
Copy link
Author

no open() syscall on any /usr/local/share/ocr-fileformat/xslt/* when doing strace -f.

But calling execve("/usr/bin/java", ["java", "-jar", "/usr/local/share/ocr-fileformat/vendor/JPageConverter/PageConverter.jar", "-neg-coords", "toZero", "-source-xml", "OCR_0007.xml", "-target-xml", "xxx", "-convert-to", "ALTO"], 0x5614283d4a10 /* 24 vars */) = 0

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Feb 28, 2020

I've checked the docs of the most recent JPageConverter: -convert-to available versions:

  • LATEST
  • 2013-07-15
  • 2010-03-19
  • but not: ALTO ???

@jbarth-ubhd
Copy link
Author

Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13

@kba
Copy link
Collaborator

kba commented Feb 28, 2020

Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13

Indeed, PAGE-ALTO conversion requires word segmentation. @maxnth Can you think of any sensible workaround?

@jbarth-ubhd
Copy link
Author

Did a quick-and-dirty script: https://gist.github.com/jbarth-ubhd/0e867c20008639145386a7978fdb27a4

@kba
Copy link
Collaborator

kba commented Feb 28, 2020

Great but maybe we can integrate pseudo-word creation on-the-fly directly into the converter, with a cmdline flag.

@maxnth
Copy link

maxnth commented Feb 28, 2020

Word level PAGE XML output for calamari has already been planned for some time now but sadly we didn't get to actually implementing it yet.
It's one of my next tasks though and hopefully will get included in calamari within the upcoming month.
I don't know whether that's too late for this specific case but maybe the info that the feature is being worked on might help anyways.

@jbarth-ubhd
Copy link
Author

seems not to be fixed in v0.4.0.

@kba
Copy link
Collaborator

kba commented Dec 21, 2020

seems not to be fixed in v0.4.0.

ocrd_calamari is at 1.0.0 and calamari at 1.0.5 but word-level PAGE output is indeed not implemented yet in calamari AFAICT

@mikegerber
Copy link
Contributor

mikegerber commented Feb 5, 2021

ocrd_calamari (but AFAIK not Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up earlier, I just didn't know about this issue here.

@jbarth-ubhd You need to set ocrd_calamari's parameter -P textequiv_level word.

Quoting ocrd_calamari's README:

In addition to the line text it may also output word and glyph segmentation including per-glyph confidence values and per-glyph alternative predictions as provided by the Calamari OCR engine, using a textequiv_level of word or glyph. Note that while Calamari does not provide word segmentation, this processor produces word segmentation inferred from text segmentation and the glyph positions. The provided glyph and word segmentation can be used for text extraction and highlighting, but is probably not useful for further image-based processing.

ocrd_calamari does more than Calamari here because we wanted to include Calamari's glyph level infos, i.e. character positions and alternative (less probable) character predictions; and as PAGE XML has a strict line>word>glyph hierarchy, we needed to include a word segmentation. This word segmentation is inferred from the text, e.g. "Lorem ipsum dolor sit amet" becomes "Lorem| |ipsum| |dolor| |sit| |amet", strictly on spaces as expected by OCR-D's validation.

@mikegerber
Copy link
Contributor

Indeed, PAGE-ALTO conversion requires word segmentation.

I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason. 😀

@mikegerber
Copy link
Contributor

What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion is not possible.

@bertsky
Copy link
Contributor

bertsky commented Jun 6, 2023

No need for any of this, entirely, since we have been using https://github.com/kba/page-to-alto for this purpose instead since #134.

I suggest closing (cannot do it myself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants