-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Mixed type PDF's (Saeco Example) & OCR before template matching #409
Conversation
93d6577
to
5e6a85d
Compare
The solution is here 😄 ✨ Added ocrmypdf as an input module. (it is actually more of a pre-processor) ocrmypdf has a function to redo-ocr. This is the function we are calling by default for invoice2data. It will leave the exsisting texts unchanged. Images which not have been ocr'ed yet will be processed. Still, there is the problem, that the pdf needs to be ocr'ed before keywords can be matched. It works like this: ocrmypdf also has a lot of functionality to bring to the table.
Personally I was using unpaper in my pipeline to process scanned receipts before sending them to my invoice2data server. Having the possibility to let invoice2data do it makes it much easier. You can fully control the parameters of ocrmypdf and it's underlying modules (unpaper, tesseract) by passing your parameters in the |
Test coverage is dropping because of this one. So I will add some more tests. |
8f5ad4b
to
06114df
Compare
Coverage is 80% + This should do. |
e6a3900
to
19fe96a
Compare
@rmilecki Can you review this pr? 🙏 |
return empty string
add test fallback to ocrmypdf with module installed
c259a9a
to
f7734b5
Compare
The saeco invoice is an example of a PDF with some vital data encapsulated in a image.
Like mentioned in #393
(All the keywords to identify the issuer of the invoice are encapsulated in the image.)
We need to OCR the pdf first, before an template can be reliably matched.
The current strategy to parse this kind of invoice is to pass the whole thing to the Tesseract input reader.
However this is not the preferred method. The tesseract implementations is quite good. But it still has it's flaws.
With the tesseract input module the whole document gets converted into an image, before applying the ocr.
So even the original text elements get parsed and replaced with the results from tesseract.
With the accompanied errors, it misread a
0
(zero digit) for aO
(letter).It can miss comma, dots and decimal separators.
Real world example:
becomes
Hint: The unit price changed from 49,99 to 4999
With a lot of creativity there is a way to workaround this.
I tried it, It get's the job done. With a lot of challenges..
But it would be better if we had an solution like: