Option to skip files which are already processes by some ORC scanner. #113

SynIV · 2022-04-24T18:39:05Z

Sometimes when I scan a document e.g. on my phone OCR is already done there in a pretty good quality. On the other hand when I scan files with my printer or I got some files from somewhere else which are not processed by OCR yet I like the option to automatically scan every file which is newly created on the server.

Therefore it would be absolutely great to automatically skip an OCR scan, if the file was already processed and contains printable text.
I would love the option to remove "--redo-ocr" to skip these documents without activating "--remove-background" because this has some other disadvantages according to the ocrmypdf documentation.

So I would like to ask very nicely if that would be possible. Unfortunately I am not experienced enough to contribute by myself.

R0Wi · 2022-04-25T04:57:34Z

HI @SynIV and thank's for your feature-request. First of all: of course this is possible with little effort but since we try to keep the app as simple as possible, i think we have to discuss the default behaviour a little bit.

According to the docs there are basically 3 flags for OCRmyPDF to make it skip some pages inside the PDF. The general target was to support "born digital" documents as well as scanned documents and mixed content, too. So rethinking this might lead to the conclusion that the default flag should be --skip-text instead of --redo-ocr so that pages that already contain text (regardless if it's a visible or invisible/OCR text layer) are skipped.

Could you please try to reproduce both of your use-cases via ocrymypdf command directly on CLI and give us some feedback if that fits your needs? So basically something like

ocrmypdf --skip-text input.pdf output.pdf

SynIV · 2022-04-25T07:08:56Z

Hi @R0Wi,

Thank you for your quick answer.

I have tested the --skip-text option on a only half scanned file and it works great. As described in the documentation the already scanned pages are skipped.

I think this would be a nice default behavior.

I understand that you try to keep the app as simple as possible but in my opinion it would make the app more individual and customizable if unseres could set options to rescan or skip pages with existing printable text.

For me I would be happy with --skip-text as the default behavior 😄

R0Wi · 2022-04-25T07:12:07Z

Thank's for your fast feedback. I will discuss this with @bahnwaerter and i think we can deliver a suitable solution in the next days. We will track our progress here 👍

doppelgrau · 2022-06-07T09:25:22Z

Out of curiosity (I'd also like to avoid double OCR), is there a decision to change the default?

R0Wi · 2022-06-07T09:42:13Z

I think the advantage of using --redo-ocr is that there can also be pages with mixed content. For example a word document exported as PDF with some text and an image (containing text) on the same page would be processed without touching the visible text but processing the image on the page, adding a layer just over that image. In that situation --skip-text would just skip the whole page because it notices that there is already text on that page.

I think we can go that way:

Change the default behaviour to use --skip-text instead of --redo-ocr
New feature: add an option to configure the flag to be used inside the config UI. Make it exclusive when using --redo-ocr (disable "remove background" option then, see https://github.com/ocrmypdf/OCRmyPDF/blob/776ada671391a6282cdf397c78a3487fb1607059/src/ocrmypdf/_validation.py#L102)

@bahnwaerter any thoughts?

bahnwaerter · 2022-06-14T20:09:04Z

Thanks @SynIV for reporting this unfavorable behavior in your desired use case.

As @R0Wi already said, the --skip-text option skips all pages that contain text, regardless of the case of mixed content (text and images). This functionality is problematic if OCR has to be performed on images on such mixed content pages. Therefore, we decided to use the --redo-ocr option as the default instead.

To cover use cases described by @SynIV, we have to change the default option from --redo-ocr to --skip-text. Therefore, I agree with the proposed changes by @R0Wi. Please keep in mind @R0Wi, that this fundamental change is documented accordingly to prevent further issue and bug reports. From a performance perspective, changing the default option has the benefit of processing PDF files with a lot of mixed content much faster. I think most people will benefit from this effect, otherwise they have to use the new configuration option in the UI.

Closing #113

SynIV · 2022-06-18T09:37:08Z

Thank you so much! 😊

Closing #113

R0Wi · 2022-06-18T10:02:44Z

Thank you so much! 😊

Please let me know if you encounter any errors. Just pushed to the appstore for NC23 and NC24 🚀

SynIV changed the title ~~Option to remove skip files which are already processes by some ORC scanner.~~ Option to skip files which are already processes by some ORC scanner. Apr 24, 2022

bahnwaerter assigned R0Wi Jun 14, 2022

bahnwaerter added the enhancement New feature or request label Jun 14, 2022

R0Wi added a commit that referenced this issue Jun 18, 2022

Implement --skip-text as default setting

8320034

Closing #113

R0Wi mentioned this issue Jun 18, 2022

Implement --skip-text as default setting #126

Merged

R0Wi linked a pull request Jun 18, 2022 that will close this issue

Implement --skip-text as default setting #126

Merged

R0Wi closed this as completed in #126 Jun 18, 2022

R0Wi added a commit that referenced this issue Jun 18, 2022

Implement --skip-text as default setting (#126)

e90c57a

Closing #113

R0Wi added a commit that referenced this issue Jun 18, 2022

Implement --skip-text as default setting (#126)

588337d

Closing #113

R0Wi added a commit that referenced this issue Jun 18, 2022

Implement --skip-text as default setting (#126)

b51ff21

Closing #113

R0Wi added a commit that referenced this issue Jun 18, 2022

Implement --skip-text as default setting (#126) (#127)

3a8cef9

Closing #113

R0Wi added a commit that referenced this issue Jun 18, 2022

Implement --skip-text as default setting (#126) (#128)

5491b40

Closing #113

R0Wi mentioned this issue Jun 18, 2022

Make OCR skip options configurable #129

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to skip files which are already processes by some ORC scanner. #113

Option to skip files which are already processes by some ORC scanner. #113

SynIV commented Apr 24, 2022

R0Wi commented Apr 25, 2022

SynIV commented Apr 25, 2022

R0Wi commented Apr 25, 2022

doppelgrau commented Jun 7, 2022

R0Wi commented Jun 7, 2022

bahnwaerter commented Jun 14, 2022

SynIV commented Jun 18, 2022

R0Wi commented Jun 18, 2022

Option to skip files which are already processes by some ORC scanner. #113

Option to skip files which are already processes by some ORC scanner. #113

Comments

SynIV commented Apr 24, 2022

R0Wi commented Apr 25, 2022

SynIV commented Apr 25, 2022

R0Wi commented Apr 25, 2022

doppelgrau commented Jun 7, 2022

R0Wi commented Jun 7, 2022

bahnwaerter commented Jun 14, 2022

SynIV commented Jun 18, 2022

R0Wi commented Jun 18, 2022