Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409

bosd · 2022-09-25T14:17:34Z

The saeco invoice is an example of a PDF with some vital data encapsulated in a image.
Like mentioned in #393

(All the keywords to identify the issuer of the invoice are encapsulated in the image.)
We need to OCR the pdf first, before an template can be reliably matched.

The current strategy to parse this kind of invoice is to pass the whole thing to the Tesseract input reader.
However this is not the preferred method. The tesseract implementations is quite good. But it still has it's flaws.
With the tesseract input module the whole document gets converted into an image, before applying the ocr.
So even the original text elements get parsed and replaced with the results from tesseract.
With the accompanied errors, it misread a 0 (zero digit) for a O (letter).
It can miss comma, dots and decimal separators.

Real world example:

E103184          Onderhoudsset CA6707/10                                        49,99     21 %    1 PCS            49,99

becomes

E103184           Onderhoudsset CA6707/10                                            4999         21%             1 PCS          49,99`

Hint: The unit price changed from 49,99 to 4999

With a lot of creativity there is a way to workaround this.

Fixup the ocr mistakes artefacts
Generate some keyword matches from specifc patterns
I tried it, It get's the job done. With a lot of challenges..

But it would be better if we had an solution like:

Pass only the images to OCR
Parse the texts from the PDF directly.

bosd · 2023-02-26T10:55:14Z

The solution is here 😄 ✨

Added ocrmypdf as an input module. (it is actually more of a pre-processor)
Now we are able to parse this kind of mixed-type invoice.

ocrmypdf has a function to redo-ocr. This is the function we are calling by default for invoice2data. It will leave the exsisting texts unchanged. Images which not have been ocr'ed yet will be processed.

Still, there is the problem, that the pdf needs to be ocr'ed before keywords can be matched.
To solve this issue, ocrmypdf is now a fallback module.

It works like this:
If the default pdftotext input parser failes to match a template. It will check if ocrmypdf is installed.
If it is installed, ocrmypdf is called with the redo-ocr parameters. The result of ocrmypdf will be sent trough pdftotext again and tries to find an template again. The template is found, and extraction of the pdf will start.
See the Saeco example.

ocrmypdf also has a lot of functionality to bring to the table.

Generating a PDF/A file.
Optimizes the PDF images, to reduce file size
deskews and/or cleans the image before performing tesseract OCR (Very handy for scanned tickets and receipts)
under the hood it uses unpaper for that

Personally I was using unpaper in my pipeline to process scanned receipts before sending them to my invoice2data server. Having the possibility to let invoice2data do it makes it much easier.

You can fully control the parameters of ocrmypdf and it's underlying modules (unpaper, tesseract) by passing your parameters in the input_reader_config dict.

bosd · 2023-02-27T12:13:23Z

Test coverage is dropping because of this one. So I will add some more tests.

bosd · 2023-02-27T22:31:47Z

Coverage is 80% + This should do.

bosd · 2023-03-16T06:51:23Z

@rmilecki Can you review this pr? 🙏

bosd · 2023-03-19T09:33:26Z

This PR introduces some important functionality. It closes the gaps of mixed type pdf's which previsously could not been parsed.
I'm keen for getting it into production. (Which pulls invoice2data form pypi)
Can either @rmilecki or @m3nu approve so we can merge this.

return empty string

add test fallback to ocrmypdf with module installed

bosd marked this pull request as draft September 25, 2022 14:22

bosd force-pushed the saeco branch 7 times, most recently from 93d6577 to 5e6a85d Compare February 26, 2023 10:14

bosd marked this pull request as ready for review February 26, 2023 10:55

bosd requested review from m3nu, alexis-via and rmilecki February 26, 2023 10:56

bosd changed the title ~~[WIP] Saeco Example text & OCR to data~~ Support for Mixed type PDF's (Saeco Example) & OCR before template matching Feb 26, 2023

bosd added type:feature New feature or request type:enhancement New feature or request good first issue Good for newcomers labels Feb 26, 2023

bosd force-pushed the saeco branch 3 times, most recently from 8f5ad4b to 06114df Compare February 27, 2023 22:26

bosd force-pushed the saeco branch from 06114df to e4de866 Compare February 28, 2023 18:57

bosd force-pushed the saeco branch 3 times, most recently from e6a3900 to 19fe96a Compare March 16, 2023 06:50

bosd added priority:medium priority medium and removed good first issue Good for newcomers labels Mar 16, 2023

bosd force-pushed the saeco branch from 19fe96a to 4cff8ec Compare March 19, 2023 09:27

bosd added 7 commits March 30, 2023 11:04

Saeco Example text & OCR to data

e03ba89

add ocrmypdf input module

bb199dd

return empty string

[IMP]Tests: add option to exclude files for input reader specific tests

67dee6e

add unit test for ocrmypdf input module using saeco

0986ef2

use ocrmypdf as fallback

762a0b3

add test fallback to ocrmypdf with module installed

Install ocrmypdf and dependencies

bf1f07e

Test if ocrmypdf is (un)/available

f7734b5

bosd force-pushed the saeco branch 2 times, most recently from c259a9a to f7734b5 Compare March 30, 2023 20:40

bosd merged commit f663826 into invoice-x:master Mar 30, 2023

bosd deleted the saeco branch March 30, 2023 20:46

bosd mentioned this pull request Jun 19, 2023

image to data #393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409

Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409

bosd commented Sep 25, 2022 •

edited

Loading

bosd commented Feb 26, 2023 •

edited

Loading

bosd commented Feb 27, 2023

bosd commented Feb 27, 2023

bosd commented Mar 16, 2023

bosd commented Mar 19, 2023

Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409

Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409

Conversation

bosd commented Sep 25, 2022 • edited Loading

bosd commented Feb 26, 2023 • edited Loading

bosd commented Feb 27, 2023

bosd commented Feb 27, 2023

bosd commented Mar 16, 2023

bosd commented Mar 19, 2023

bosd commented Sep 25, 2022 •

edited

Loading

bosd commented Feb 26, 2023 •

edited

Loading