-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lines: support multiple RegEx-es in lines entries #417
Lines: support multiple RegEx-es in lines entries #417
Conversation
rmilecki
commented
Oct 1, 2022
•
edited
Loading
edited
This is my alternative to the #378 that should really work this time. @bosd: I believe you can parse your
that template gives me:
which I believe is what you expected. |
@rmilecki Thanks for your efforts to produce an alternative with cleaner code. Running the first test, on quality hosting example file. This is the result:
Note: it contantate correctly. But something odd is going on in the last line of the first page.
The output from pdftotext for this part is:
note the line It does not contain a white space character at the beginning of the line.
But it does |
@bosd: can you paste or send me full |
Here it is:
|
@bosd: OK, I just found it's about That problem you reported - about parsing
So while I agree |
I agree with you on this one.
I think that is something that needs to be fixed. as of in #378 where this pr is an alternative implementation of that. I'll respectfully disagree with you on changing the invoice template. In this case, it is possible to add a last line rule, as of your proposal in #422. Maybe better to leave suboptimal tests and examples in this library. Just as an showcase. (Same goes for the OCR examples in this repo). It is definityle helping us to find these corner cases. However, the template set aside. My analysis of whats happening here. The regex is as follows:
When looking at the line:
It does not contain a whitespace character at the beginning of the line. However, if we feed the following into the parser:
An linebreak \n has been added in the line above our footer. Now the regex This should not happen, as the linebreak is clearly on another line. |
@bosd: in your analysis of
I agree with you. I believe you are correct. It should not be used in the output and it isn't used in the output. So things work just like you expect them. Please check this
So I think everything works correct and just like described that you expect them to. |
@rmilecki Thanks for the very clear information. |
@bosd: so could we have this one merged now, please? It's a clear implementation, solves actual problem you reported, doesn't seem to regress anything. I find it a nice feature. It's not meant to solve all cases our templates can't handle now. But it does solve one and I believe it's worth to have it. We can work on handling more cases in further changes (e.g. #407, #423) but we need to start moving forward with something. I'm happy to discuss and work together on other cases later. At the same time I'd like to start merging proposed features. |
@rmilecki I want to move this one forward as well. As I really want to have this functionality. |
@rmilecki Thanks for collaborating on this and all your efforts! Let's Merge!! 🎉 ✨ Tested this code against a bunch of pdf's and templates locally with great success!!!! Some notes:
I tried a variant with the new syntax, but could not make it work.
(I will take care of no 2, as it is in my pipeline to provide a new real invoice) |
So far "first_line", "line" and "last_line" could contain a single RegEx only. Some invoices have lines that use more than one format. To simplify parsin them allow all 3 entries to contain list of RegEx-es. Example: fields: lines: parser: lines start: Item\s+Discount\s+Price$ end: \s+Total line: - Items group:\s+(?P<group>.+) - (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+) Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
a5558a4
to
81d4cfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested extensively! LGTM!! 🎉 🥇