Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409

Merged
merged 7 commits into from
Mar 30, 2023

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Sep 25, 2022

The saeco invoice is an example of a PDF with some vital data encapsulated in a image.
Like mentioned in #393

(All the keywords to identify the issuer of the invoice are encapsulated in the image.)
We need to OCR the pdf first, before an template can be reliably matched.

The current strategy to parse this kind of invoice is to pass the whole thing to the Tesseract input reader.
However this is not the preferred method. The tesseract implementations is quite good. But it still has it's flaws.
With the tesseract input module the whole document gets converted into an image, before applying the ocr.
So even the original text elements get parsed and replaced with the results from tesseract.
With the accompanied errors, it misread a 0 (zero digit) for a O (letter).
It can miss comma, dots and decimal separators.

Real world example:

E103184          Onderhoudsset CA6707/10                                        49,99     21 %    1 PCS            49,99

becomes

E103184           Onderhoudsset CA6707/10                                            4999         21%             1 PCS          49,99`

Hint: The unit price changed from 49,99 to 4999

With a lot of creativity there is a way to workaround this.

  • Fixup the ocr mistakes artefacts
  • Generate some keyword matches from specifc patterns
    I tried it, It get's the job done. With a lot of challenges..

But it would be better if we had an solution like:

  • Pass only the images to OCR
  • Parse the texts from the PDF directly.

@bosd bosd marked this pull request as draft September 25, 2022 14:22
@bosd bosd force-pushed the saeco branch 7 times, most recently from 93d6577 to 5e6a85d Compare February 26, 2023 10:14
@bosd
Copy link
Collaborator Author

bosd commented Feb 26, 2023

The solution is here 😄 ✨

Added ocrmypdf as an input module. (it is actually more of a pre-processor)
Now we are able to parse this kind of mixed-type invoice.

ocrmypdf has a function to redo-ocr. This is the function we are calling by default for invoice2data. It will leave the exsisting texts unchanged. Images which not have been ocr'ed yet will be processed.

Still, there is the problem, that the pdf needs to be ocr'ed before keywords can be matched.
To solve this issue, ocrmypdf is now a fallback module.

It works like this:
If the default pdftotext input parser failes to match a template. It will check if ocrmypdf is installed.
If it is installed, ocrmypdf is called with the redo-ocr parameters. The result of ocrmypdf will be sent trough pdftotext again and tries to find an template again. The template is found, and extraction of the pdf will start.
See the Saeco example.

ocrmypdf also has a lot of functionality to bring to the table.

  • Generating a PDF/A file.
  • Optimizes the PDF images, to reduce file size
  • deskews and/or cleans the image before performing tesseract OCR (Very handy for scanned tickets and receipts)
    under the hood it uses unpaper for that

Personally I was using unpaper in my pipeline to process scanned receipts before sending them to my invoice2data server. Having the possibility to let invoice2data do it makes it much easier.

You can fully control the parameters of ocrmypdf and it's underlying modules (unpaper, tesseract) by passing your parameters in the input_reader_config dict.

@bosd bosd marked this pull request as ready for review February 26, 2023 10:55
@bosd bosd requested review from m3nu, alexis-via and rmilecki February 26, 2023 10:56
@bosd bosd changed the title [WIP] Saeco Example text & OCR to data Support for Mixed type PDF's (Saeco Example) & OCR before template matching Feb 26, 2023
@bosd bosd added type:feature New feature or request type:enhancement New feature or request good first issue Good for newcomers labels Feb 26, 2023
@bosd
Copy link
Collaborator Author

bosd commented Feb 27, 2023

Test coverage is dropping because of this one. So I will add some more tests.

@bosd bosd force-pushed the saeco branch 3 times, most recently from 8f5ad4b to 06114df Compare February 27, 2023 22:26
@bosd
Copy link
Collaborator Author

bosd commented Feb 27, 2023

Coverage is 80% + This should do.

@bosd bosd force-pushed the saeco branch 3 times, most recently from e6a3900 to 19fe96a Compare March 16, 2023 06:50
@bosd bosd added priority:medium priority medium and removed good first issue Good for newcomers labels Mar 16, 2023
@bosd
Copy link
Collaborator Author

bosd commented Mar 16, 2023

@rmilecki Can you review this pr? 🙏

@bosd
Copy link
Collaborator Author

bosd commented Mar 19, 2023

This PR introduces some important functionality. It closes the gaps of mixed type pdf's which previsously could not been parsed.
I'm keen for getting it into production. (Which pulls invoice2data form pypi)
Can either @rmilecki or @m3nu approve so we can merge this.

@bosd bosd force-pushed the saeco branch 2 times, most recently from c259a9a to f7734b5 Compare March 30, 2023 20:40
@bosd bosd merged commit f663826 into invoice-x:master Mar 30, 2023
@bosd bosd deleted the saeco branch March 30, 2023 20:46
@bosd bosd mentioned this pull request Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:medium priority medium type:enhancement New feature or request type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant