-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HOCR output as a sidecar option #177
Comments
ocrmypdf has three PDF renderers. One of them is called the So
which will output a temporary folder with all working files, including the hocr files per page. The main drawback of the hocr renderer is that its support for non-Latin script is poor. If you'd prefer to force generation of hocr files using the new (and default)
I will think about adding an option for hocr sidecars that involves less hackery, but this should do it for now. |
Any further thoughts on adding additional sidecar features? |
Your suggestion to use |
This probably is related, but I was wondering if there's a good way to store the hOCR on the side and then "apply" it to the PDF when needed. I want to retain the original PDF files without having to basically duplicate them, more than doubling storage costs. |
You can use the API functions in ocrmypdf.api to save a hocr and apply it later. |
@jbarlow83 Which API functions? Are there command-line flag(s) that could do the trick just yet? |
Line 383 in 3a75b20
|
I have many applications where the physical location of text on a page is significant, and an existing codebase built around the HOCR html format.
What would make this library completely killer is an option to produce a sidecar file of the hocr data from Tesseract. I know that Tesseract natively can produce HOCR data, so the change shouldn't be difficult. The only question is how to integrate that into the existing command line interface.
Maybe a new option for --sidecar-hocr?
Flipping through the codebase now to see if there's an easy option.
The text was updated successfully, but these errors were encountered: