Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to skip files which are already processes by some ORC scanner. #113

Closed
SynIV opened this issue Apr 24, 2022 · 8 comments · Fixed by #126
Closed

Option to skip files which are already processes by some ORC scanner. #113

SynIV opened this issue Apr 24, 2022 · 8 comments · Fixed by #126
Assignees
Labels
enhancement New feature or request

Comments

@SynIV
Copy link

SynIV commented Apr 24, 2022

Sometimes when I scan a document e.g. on my phone OCR is already done there in a pretty good quality. On the other hand when I scan files with my printer or I got some files from somewhere else which are not processed by OCR yet I like the option to automatically scan every file which is newly created on the server.

Therefore it would be absolutely great to automatically skip an OCR scan, if the file was already processed and contains printable text.
I would love the option to remove "--redo-ocr" to skip these documents without activating "--remove-background" because this has some other disadvantages according to the ocrmypdf documentation.

So I would like to ask very nicely if that would be possible. Unfortunately I am not experienced enough to contribute by myself.

@SynIV SynIV changed the title Option to remove skip files which are already processes by some ORC scanner. Option to skip files which are already processes by some ORC scanner. Apr 24, 2022
@R0Wi
Copy link
Contributor

R0Wi commented Apr 25, 2022

HI @SynIV and thank's for your feature-request. First of all: of course this is possible with little effort but since we try to keep the app as simple as possible, i think we have to discuss the default behaviour a little bit.

According to the docs there are basically 3 flags for OCRmyPDF to make it skip some pages inside the PDF. The general target was to support "born digital" documents as well as scanned documents and mixed content, too. So rethinking this might lead to the conclusion that the default flag should be --skip-text instead of --redo-ocr so that pages that already contain text (regardless if it's a visible or invisible/OCR text layer) are skipped.

Could you please try to reproduce both of your use-cases via ocrymypdf command directly on CLI and give us some feedback if that fits your needs? So basically something like

ocrmypdf --skip-text input.pdf output.pdf

@SynIV
Copy link
Author

SynIV commented Apr 25, 2022

Hi @R0Wi,

Thank you for your quick answer.

I have tested the --skip-text option on a only half scanned file and it works great. As described in the documentation the already scanned pages are skipped.

I think this would be a nice default behavior.

I understand that you try to keep the app as simple as possible but in my opinion it would make the app more individual and customizable if unseres could set options to rescan or skip pages with existing printable text.

For me I would be happy with --skip-text as the default behavior 😄

@R0Wi
Copy link
Contributor

R0Wi commented Apr 25, 2022

Thank's for your fast feedback. I will discuss this with @bahnwaerter and i think we can deliver a suitable solution in the next days. We will track our progress here 👍

@doppelgrau
Copy link

Out of curiosity (I'd also like to avoid double OCR), is there a decision to change the default?

@R0Wi
Copy link
Contributor

R0Wi commented Jun 7, 2022

I think the advantage of using --redo-ocr is that there can also be pages with mixed content. For example a word document exported as PDF with some text and an image (containing text) on the same page would be processed without touching the visible text but processing the image on the page, adding a layer just over that image. In that situation --skip-text would just skip the whole page because it notices that there is already text on that page.

I think we can go that way:

@bahnwaerter any thoughts?

@bahnwaerter
Copy link
Collaborator

Thanks @SynIV for reporting this unfavorable behavior in your desired use case.

As @R0Wi already said, the --skip-text option skips all pages that contain text, regardless of the case of mixed content (text and images). This functionality is problematic if OCR has to be performed on images on such mixed content pages. Therefore, we decided to use the --redo-ocr option as the default instead.

To cover use cases described by @SynIV, we have to change the default option from --redo-ocr to --skip-text. Therefore, I agree with the proposed changes by @R0Wi. Please keep in mind @R0Wi, that this fundamental change is documented accordingly to prevent further issue and bug reports. From a performance perspective, changing the default option has the benefit of processing PDF files with a lot of mixed content much faster. I think most people will benefit from this effect, otherwise they have to use the new configuration option in the UI.

@bahnwaerter bahnwaerter added the enhancement New feature or request label Jun 14, 2022
R0Wi added a commit that referenced this issue Jun 18, 2022
@R0Wi R0Wi linked a pull request Jun 18, 2022 that will close this issue
@R0Wi R0Wi closed this as completed in #126 Jun 18, 2022
R0Wi added a commit that referenced this issue Jun 18, 2022
R0Wi added a commit that referenced this issue Jun 18, 2022
R0Wi added a commit that referenced this issue Jun 18, 2022
@SynIV
Copy link
Author

SynIV commented Jun 18, 2022

Thank you so much! 😊

@R0Wi
Copy link
Contributor

R0Wi commented Jun 18, 2022

Thank you so much! 😊

Please let me know if you encounter any errors. Just pushed to the appstore for NC23 and NC24 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants