[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

sklart · 2024-10-01T19:13:07Z

Describe the bug

Hello!
There was an interesting problem with a file Сертификат качества №3490 (опора 1У110-5+10, 2 шт.).pdf.

It opens only in PDF-XChange Editor 10.4.1.389, and does not open in either Foxit PDF Editor Pro 12.1.1.15289 or Acrobat Pro 2023.006.20360 (64-Bit).
It seems to me that pikepdf saves in some pdf format that is not supported by other editors.

The algorithm for creating PDF is as follows:

Scan to MFP.
Manual processing in PDF-XChange Editor: rotating pages, changing page sizes to A4, A3 formats (sometimes PDFs with non-standard dimensions of several hundred mm are received from the MFP).
Next, in all PDFs in the folder I run optical text recognition using the ocrmypdf program (pikepdf is used by it to process pdf) with the command
FOR /r %F IN (*.pdf) DO ocrmypdf -l eng+rus --rotate-pages --skip-text --optimize 1 --output-type pdf "%F" "%~fF"

After converting a page of a document in PDF-XChange Editor into an image and then saving it back to PDF, the file can be read by all editors.

But I still wonder what could be the reason for this behavior?
Maybe someone can tell me?

P.S. I initially wrote about this behavior in the pikepdf issues, but they explained that this is not a consequence of the work of this program and this is more likely to turn out to be an OCRmyPDF issue.

Steps to reproduce

No response

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

ocrmypdf 16.4.3

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-10-01T21:22:54Z

Please provide the input file, before processing with ocrmypdf.

sklart · 2024-10-02T08:24:05Z

jbarlow83 · 2024-10-12T16:47:02Z

Between 10.02.30 and 10.31.21, the page size changed to 595 x 0 pts (invalid). The command line pdfinfo program can confirm this.

Whatever changed the page size to this invalid value seems to have happened after ocrmypdf processed the file, and the same version of ocrmypdf and pikepdf was used for this processing.

It appears that ocrmypdf is being repeatedly applied to files even when they already have OCR. This shouldn't be necessary - by default ocrmypdf exits with an error if existing text is detected.

That said there seems to be an issue regarding mediabox and cropboxes here which causes the page size to be changed in this way.

sklart added the triage Issue needs triage label Oct 1, 2024

sklart assigned jbarlow83 Oct 1, 2024

jbarlow83 added need test file and removed triage Issue needs triage labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

sklart commented Oct 1, 2024

jbarlow83 commented Oct 1, 2024

sklart commented Oct 2, 2024

jbarlow83 commented Oct 12, 2024

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

Comments

sklart commented Oct 1, 2024

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

jbarlow83 commented Oct 1, 2024

sklart commented Oct 2, 2024

jbarlow83 commented Oct 12, 2024