-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hocr import / export #453
Comments
What a coincidence. I wanted to request the same feature today. This feature request has some overlap with #177. |
I think this would be great but there's a lot to do to make it work, especially to support after the fact editing. |
All I would need is an option to merge an hOCR HTML file with a PDF file. On the Internet, I read about I also found Finally, gImageReader does output a PDF with hOCR data, but it does not work with 800+ pages (200+ MiB)—it creates a corrupted/damaged PDF (at least that’s what Evince and Adobe Acrobat Reader says). |
Have you reported this to gImageReader Repo? |
ocrmypdf already has the ability to merge hOCR HTML into PDF through its public APIs. What it does not have is a convenient way to run its post-processing on a set of edited hOCR files. |
Thanks, @jbarlow83, for reply. I haven’t every used For what I want is to merge hOCR HTML file (generated by gImageReader and @aalmir, not exactly. I reported it in this comment and marginally in manisandro/gImageReader#480. |
@tukusejssirs The relevant code is in hocrtransform.py. See |
@jbarlow83, thanks! Is there any way to either input multiple images or pre-created PDF file? @aalmir, the issue is here → manisandro/gImageReader#486. |
No, it doesn't have that ability, but you could split the hOCR and run a loop. |
@jbarlow83, I have just tried to merge hOCR data and an image into a PDF, but it failed (see below). Am I missing some kind of module or something? Here are the test files. Note that although I want to output the interword spaces, I don’t want them where I have commented out the whitespace between
|
It looks like the XML ( |
@jbarlow83, do I get it right that you say the a soft hyphen (
|
ocrmypdf.hocrtransform is only capable of parsing the subset of hOCR generated by Tesseract. For this specific case, you'll need to add a string like the following to the top of the hOCR file <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY shy '#x00AD'>
]> And U+00AD being the Unicode code point for soft hyphen. |
(Note that doctype signature may actually be incorrect for hOCR; whatever the hOCR spec says is correct should be used.) |
Thanks, @jbarlow83, that sort of worked.
<!DOCTYPE html [
<!ENTITY shy '#x00AD'>
<!ENTITY thinsp '#x2009'>
]>
|
Official definition is <!ENTITY shy CDATA "­" -- soft hyphen = discretionary hyphen,
U+00AD ISOnum --> |
Thanks again, @jbarlow83, but
<!DOCTYPE html [
<!ENTITY shy CDATA "­" -- soft hyphen = discretionary hyphen, U+00AD ISOnum -->
<!ENTITY thinsp "ߙ">
]>
|
Remove the -- and the comments
after it.
…On Fri., Jan. 15, 2021, 13:29 Tukusej’s Sirs, ***@***.***> wrote:
Thanks again, @jbarlow83 <https://github.com/jbarlow83>, but CDATA causes
an error using ocrmypdf.hocrtransform (same command as above).
1. Change the doctype to the following (the doctype itself does not
matter):
<!DOCTYPE html [
<!ENTITY shy CDATA "­" -- soft hyphen = discretionary hyphen, U+00AD ISOnum -->
<!ENTITY thinsp "ߙ">
]>
1. Run the ocrmypdf.hocrtransform command (same as above). This
produces an error on the L2C17 character (2nd space before CDATA on shy
entity definition.
python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces
024_hocr.html merged.pdf
/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Traceback (most recent call last):
File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 382, in <module>
hocr = HocrTransform(args.hocrfile, args.resolution)
File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 68, in __init__
self.hocr = ElementTree.parse(hocrFileName)
File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 595, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: syntax error: line 2, column 17
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#453 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN5YM4LULPJNDW7QDDG6L3S2CXVFANCNFSM4JMCQF4A>
.
|
<!DOCTYPE html [
<!ENTITY shy CDATA "­">
<!ENTITY thinsp "ߙ">
]> |
@jbarlow83, I’ve already tried that; same result as I posted in my last comment. It is quite interesting to me that your parser (or whatever reports tnat error) says the error occurs on a space (the second one before |
The parser is just Python's widely used XML parser from the standard library ( import xml.etree.ElementTree as ET
x = ET.fromstring(
"""
<!DOCTYPE html [
<!ENTITY shy "­">
<!ENTITY thinsp "ߙ">
]>
<test>­</test>"""
)
assert x.text == '\xad' |
@jbarlow83, thanks, that helped for soft hyphens, but not for thin spaces. Thin spaces are replaced with
|
What you're trying to do likely requires software development beyond this one issue. You'll need to find someone who can do that coding for you. |
Okay, thanks anyway for the help. 😃 |
I don't know whether gscan2pdf uses gImageReader for it, but in gscan2pdf I get a possibility to visually alter the OCR'ed text, based on the HOCR, and visually showing the confidence with colors. |
Apologies if this is already answered, but I've been digging around for a few days and haven't found a definitive answer for this: What's the best/easiest/recommended way to
In other words, use OCRmyPDF in a workflow where a human can come in and hand-correct (and quite possibly version control) the OCR data before performing the merge? I've been playing around with running something like I'm fine with digging around in the code if necessary... in fact, that's probably where I'm going to be going after I finish writing this. I just figured I'd ask first in case someone has the answers close at hand. Thanks, Footnotes
|
There are no great options for hand-correction because I tend to see as a problem a command line program is not well equipped to solve. I suppose I (or you) could insert a plugin hook that allows a custom renderer at src/ocrmypdf/_sync.py:216. Then you could intercept the .hocr file and make edits to it before proceeding with the standard hocrtransform. The pages would come at you out of order, because ocrmypdf processes pages in parallel. |
You might be interested in the flow of GScan2pdf, which offers a scanned text-edit feature:
https://images.pling.com/img/00/00/49/49/34/1230285/43c303619853d3cf57804eb721a0abac19d8.png
However I don’t believe it supports PDF/A or MRC compression(plugin).
|
@rmast Checking it out now. It's going to be chewing on the 214 MiB 324-page PDF I just fed it for a while, but what it's showing while it processes each page is making me optimistic. Well, besides looking a little bit like a ransom letter, but I can understand that considering how each word seems to be in its own bounding box, and that's just kind of how things are. |
GScan2pdf almost works. It gets up to saving page 35 of 324, and then it just sits there. The UI is still responsive, it claims to be saving, but it's just... sitting there.. |
@jbarlow83 I just tossed you $200USD via OpenCollective in hopes that I can shamelessly bribe you to add some method for hocr export/import. I just spent the last couple of hours learning about postfix operators, and it's a rabbit hole I really don't want to have to go down. |
See? I just said "postfix" when I meant "PDF". |
@jaysonlarose I do appreciate the generous contribution and I'll try to think if there's something clever/efficient/reasonable for a CLI app. |
I’m even surprised it reads an existing PDF. Usually I cut PDF’s into loose pictures to process them, for example with pdfimages -tiff.
As GScan2PDF will not apply JBIG2 there’s no need to put all of them in one PDF at the end for getting one big dictionary, you’ll need more steps to get it compressed.
pdsfam will split or merge PDF’s for you.
|
@jaysonlarose, I needed something exactly to what you are describing and I forked the repo. In my version, ocrmypdf can be used in two additional workflows:
However, I don't think I'll be opening a pull request because of the way option 2 works. It needs to perform ocr again in case optimizations where done in the middle (--deskew or --rotation) so the hocr that is being passed could match the optimizations and the ocr image that are being generated in the moment. In theory, you need to call the same set of options (--rotate-pages, --clean, --deskew) if you are planning to perform these flows separately... Additionally, option 2 loses the sidecar because this one is created only if hocr is generated on the regular flow or --ocr-only flow, so in theory you should have the original sidecar with you. I'll be waiting for my 200 bucks :) |
I had the same problem as you. I wanted to edit the OCR before merging it to a PDF. This is what i did.
|
Now implemented as an experimental API |
Why do we need a separate tool when ocrmypdf already does this step internally? |
Describe the issue
If you want to create a perfect OCR, 100% correct text, you need some editing function.
For example "gImageReader" gives some basic editing function (but has some other missing features).
Expected behavior
An option to export and import hocr.
After export i can make changes on the hocr and import it for PDF creation.
The text was updated successfully, but these errors were encountered: