Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

Open
sklart opened this issue Oct 1, 2024 · 4 comments
Open

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

sklart opened this issue Oct 1, 2024 · 4 comments
Assignees

Comments

@sklart
Copy link

sklart commented Oct 1, 2024

Describe the bug

Hello!
There was an interesting problem with a file Сертификат качества №3490 (опора 1У110-5+10, 2 шт.).pdf.

It opens only in PDF-XChange Editor 10.4.1.389, and does not open in either Foxit PDF Editor Pro 12.1.1.15289 or Acrobat Pro 2023.006.20360 (64-Bit).
It seems to me that pikepdf saves in some pdf format that is not supported by other editors.

The algorithm for creating PDF is as follows:

  1. Scan to MFP.
  2. Manual processing in PDF-XChange Editor: rotating pages, changing page sizes to A4, A3 formats (sometimes PDFs with non-standard dimensions of several hundred mm are received from the MFP).
  3. Next, in all PDFs in the folder I run optical text recognition using the ocrmypdf program (pikepdf is used by it to process pdf) with the command
    FOR /r %F IN (*.pdf) DO ocrmypdf -l eng+rus --rotate-pages --skip-text --optimize 1 --output-type pdf "%F" "%~fF"

After converting a page of a document in PDF-XChange Editor into an image and then saving it back to PDF, the file can be read by all editors.

But I still wonder what could be the reason for this behavior?
Maybe someone can tell me?

P.S. I initially wrote about this behavior in the pikepdf issues, but they explained that this is not a consequence of the work of this program and this is more likely to turn out to be an OCRmyPDF issue.

Steps to reproduce

No response

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

ocrmypdf 16.4.3

Relevant log output

No response

@sklart sklart added the triage Issue needs triage label Oct 1, 2024
@jbarlow83
Copy link
Collaborator

Please provide the input file, before processing with ocrmypdf.

@jbarlow83 jbarlow83 added need test file and removed triage Issue needs triage labels Oct 1, 2024
@jbarlow83
Copy link
Collaborator

Between 10.02.30 and 10.31.21, the page size changed to 595 x 0 pts (invalid). The command line pdfinfo program can confirm this.

Whatever changed the page size to this invalid value seems to have happened after ocrmypdf processed the file, and the same version of ocrmypdf and pikepdf was used for this processing.

It appears that ocrmypdf is being repeatedly applied to files even when they already have OCR. This shouldn't be necessary - by default ocrmypdf exits with an error if existing text is detected.

That said there seems to be an issue regarding mediabox and cropboxes here which causes the page size to be changed in this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@jbarlow83 @sklart and others