Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lines: support multiple RegEx-es in lines entries #417

Merged
merged 1 commit into from
Oct 21, 2022

Conversation

rmilecki
Copy link
Collaborator

@rmilecki rmilecki commented Oct 1, 2022

So far "first_line", "line" and "last_line" could contain a single
RegEx only. Some invoices have lines that use more than one format. To
simplify parsin them allow all 3 entries to contain list of RegEx-es.

Example:
fields:
  lines:
    parser: lines
    start: Item\s+Discount\s+Price$
    end: \s+Total
    line:
      - Items group:\s+(?P<group>.+)
      - (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)

@rmilecki rmilecki requested review from m3nu and bosd October 1, 2022 21:32
@rmilecki
Copy link
Collaborator Author

rmilecki commented Oct 1, 2022

This is my alternative to the #378 that should really work this time.

@bosd: I believe you can parse your Mekro.pdf with:

# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
  lines:
    parser: lines
    start: Barcode
    line:
      - (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
      - ---(?P<line_note>.*ITEMS)---
    end: Netto totaal
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

that template gives me:

"lines": [
    {
        "line_note": "FOOD ITEMS"
    },
    {
        "barcode": "2231012001992",
        "name": "KROKETBROODJES",
        "qty": "2",
        "uom": "KG",
        "price_unit": "1,00",
        "discount": "0,0"
    },
    {
        "barcode": "8713009019455",
        "name": "Oil",
        "qty": "3",
        "uom": "L",
        "price_unit": "0,50",
        "discount": "0,0"
    },
    {
        "barcode": "8713009019475",
        "name": "Apple",
        "qty": "1",
        "uom": "KG",
        "price_unit": "50,0",
        "discount": "0,0"
    },
    {
        "line_note": "OTHER ITEMS"
    },
    {
        "barcode": "8713009019375",
        "name": "programmer",
        "qty": "1",
        "uom": "Hour",
        "price_unit": "50,0",
        "discount": "100,0"
    },
    {
        "barcode": "0013009019475",
        "name": "Sticker",
        "qty": "1",
        "uom": "pce",
        "price_unit": "0,0",
        "discount": "0,0"
    }
]

which I believe is what you expected.

@bosd bosd added this to the 0.4.0 release milestone Oct 7, 2022
@bosd
Copy link
Collaborator

bosd commented Oct 7, 2022

@rmilecki Thanks for your efforts to produce an alternative with cleaner code.

Running the first test, on quality hosting example file.
Purpose to check if contentate function is working.

This is the result:

[
    {
        "issuer": "QualityHosting AG",
        "amount": 34.73,
        "amount_untaxed": 34.73,
        "date": "2014-05-07",
        "invoice_number": "30064443",
        "vat": "DE 232 446 240",
        "currency": "EUR",
        "lines": [
            {
                "pos": "1",
                "qty": 1.0,
                "desc": "Small Business StandardExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_strukan\n01.05.14-31.05.14",
                "price": 3.89
            },
            {
                "pos": "2",
                "qty": 1.0,
                "desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_schneider\n01.05.14-31.05.14",
                "price": 5.39
            },
            {
                "pos": "3",
                "qty": 1.0,
                "desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_minar\n01.05.14-31.05.14",
                "price": 5.39
            },
            {
                "pos": "4",
                "qty": 1.0,
                "desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_mayr\n01.05.14-31.05.14",
                "price": 5.39
            },
            {
                "pos": "5",
                "qty": 1.0,
                "desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_jenewein\n01.05.14-31.05.14",
                "price": 5.39
            },
            {
                "pos": "6",
                "qty": 1.0,
                "desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_jauernik\n01.05.14-31.05.14\nQualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen\niViveLabs Ltd.\n93B Sai Yu Chung\nYuen Long, N.T.\nHong Kong\nPos.            Menge      Beschreibung                                                            Rabatt %     VK-Preis     Zeilenbetrag\nOhne      Ohne MwSt.\nMwSt.",
                "price": 5.39
            },
            {
                "pos": "7",
                "qty": 1.0,
                "desc": "Small Business StandardExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n",
                "price": 3.89
            }
        ],
        "desc": "Invoice from QualityHosting AG"
    }
]

Note: it contantate correctly.

But something odd is going on in the last line of the first page.
It (incorrectly) adds the footer content to the line:

                "pos": "6",
                "qty": 1.0,
                "desc": "Small Business QualityExchange 2010\nGrundgebühr pro Einheit\nDienst: OUDJQ_jauernik\n01.05.14-31.05.14\nQualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen\niViveLabs Ltd.\n93B Sai Yu Chung\nYuen Long, N.T.\nHong Kong\nPos.            Menge      Beschreibung                                                            Rabatt %     VK-Preis     Zeilenbetrag\nOhne      Ohne MwSt.\nMwSt.",
                "price": 5.39
            },

The output from pdftotext for this part is:

                             Grundgebühr pro Einheit
                              Dienst: OUDJQ_jauernik
                              01.05.14-31.05.14

QualityHosting AG                                 Vorstand: Christian Heit (Vorsitz),         Bankverbindung
Uferweg 40-42                                     Markus Oestreicher                          Kreissparkasse Gelnhausen


note the line
QualityHosting AG Vorstand: Christian Heit (Vorsitz),

It does not contain a white space character at the beginning of the line.
According to the invoice template it should not match.

line: '^\s+(?P<desc>.+)$'

But it does

@rmilecki
Copy link
Collaborator Author

rmilecki commented Oct 7, 2022

@bosd: can you paste or send me full pdftotext output, please? Feel free to replace your private data there with random text. That will make it much much easier for me to debug that problem.

@bosd
Copy link
Collaborator

bosd commented Oct 7, 2022

Here it is:
(No personal info there, this is one of the example files included in this library)

DEBUG:invoice2data.main:START pdftotext result ===========================
DEBUG:invoice2data.main:
      QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen




  iViveLabs Ltd.
  93B Sai Yu Chung
  Yuen Long, N.T.
  Hong Kong




Rechnung                                                                                Seite 1

Rechnungsnr.                 30064443                                                   Kundennr.              47774
Rechnungsdatum               7. Mai 2014



   Pos.            Menge      Beschreibung                                                          Rabatt %     VK-Preis     Zeilenbetrag
                                                                                                                    Ohne      Ohne MwSt.
                                                                                                                   MwSt.
                              Contract No. CON02858


       1                 1    Small Business StandardExchange 2010                                                     3,89           3,89
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_strukan
                              01.05.14-31.05.14
       2                 1    Small Business QualityExchange 2010                                                      5,39           5,39
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_schneider
                              01.05.14-31.05.14
       3                 1    Small Business QualityExchange 2010                                                      5,39           5,39
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_minar
                              01.05.14-31.05.14
       4                 1    Small Business QualityExchange 2010                                                      5,39           5,39
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_mayr
                              01.05.14-31.05.14
       5                 1    Small Business QualityExchange 2010                                                      5,39           5,39
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_jenewein
                              01.05.14-31.05.14
       6                 1    Small Business QualityExchange 2010                                                      5,39           5,39
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_jauernik
                              01.05.14-31.05.14

QualityHosting AG                                 Vorstand: Christian Heit (Vorsitz),         Bankverbindung
Uferweg 40-42                                     Markus Oestreicher                          Kreissparkasse Gelnhausen
D-63571 Gelnhausen                                Aufsichtsrat: Hans Jürgen                   Kto-Nr. 48567
Tel. +49 6051 916 44 10                           Habermann (Vorsitz)                         Blz: 507 500 94
Fax +49 6051 916 44 29                            Registergericht Hanau | HRB 13302           IBAN DE30507500940000048567
Im Internet: www.qualityhosting.de                UStId DE 232 446 240                        SWIFT HELADEF1GEL
eMail: info@qualityhosting.de                     Steuer-Nr. 044 241 601 03

  QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen




  iViveLabs Ltd.
  93B Sai Yu Chung
  Yuen Long, N.T.
  Hong Kong




Rechnung                                                                                Seite 2

Rechnungsnr.                 30064443                                                   Kundennr.                47774
Rechnungsdatum               7. Mai 2014

   Pos.            Menge      Beschreibung                                                            Rabatt %     VK-Preis     Zeilenbetrag
                                                                                                                      Ohne      Ohne MwSt.
                                                                                                                     MwSt.
       7                 1    Small Business StandardExchange 2010                                                       3,89           3,89
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_office
                              01.05.14-31.05.14


                                                                                                    Total EUR                         34,73



Zahlungsform                                 Banküberweisung
Zahlungsbedingungen                          14 Tage netto
Zahlungsziel                                 21.05.14

Für Rückfragen bzgl. dieser Rechnung wenden SIe sich bitte per E-Mail unter der Angabe Ihrer Kunden- und Rechnungsnummer an
buchhaltung@qualityhosting.de.

Einwände gegen die Ihnen berechneten Lieferungen und Leistungen sind schriftlich innerhalb 4 Wochen ab Rechnungsdatum unserer
Buchhaltung anzuzeigen. Nach Ablauf dieser Frist gelten die Beträge als genehmigt. Im Falle einer Rücklastschrift der Beträge ohne
Verschulden der QualityHosting AG berechnen wir für die uns entstandenen Kosten ein Entgeld von 15,00 EUR. Unabhängig davon
behalten wir uns die Einstellung unserer Leistungen bis zum Ausgleich unserer Forderungen ausdrücklich vor.




QualityHosting AG                                 Vorstand: Christian Heit (Vorsitz),         Bankverbindung
Uferweg 40-42                                     Markus Oestreicher                          Kreissparkasse Gelnhausen
D-63571 Gelnhausen                                Aufsichtsrat: Hans Jürgen                   Kto-Nr. 48567
Tel. +49 6051 916 44 10                           Habermann (Vorsitz)                         Blz: 507 500 94
Fax +49 6051 916 44 29                            Registergericht Hanau | HRB 13302           IBAN DE30507500940000048567
Im Internet: www.qualityhosting.de                UStId DE 232 446 240                        SWIFT HELADEF1GEL
eMail: info@qualityhosting.de                     Steuer-Nr. 044 241 601 03


DEBUG:invoice2data.main:END pdftotext result =============================


@rmilecki
Copy link
Collaborator Author

rmilecki commented Oct 8, 2022

@bosd: OK, I just found it's about QualityHosting.pdf.

That problem you reported - about parsing lines in QualityHosting.pdf - is caused by the way src/invoice2data/extract/templates/de/de.qualityhosting.yml is constructed. It has nothing to to with changes from this pull request.

  1. de.qualityhosting.yml parses lines incorrectly without this pull request changes
  2. de.qualityhosting.yml parses lines incorrectly with those changes

So while I agree de.qualityhosting.yml needs to be fixed it has really nothing to do with this pull request. I'm happy to help you fixing de.qualityhosting.yml if you need to parse QualityHosting invoices. It shouldn't be considered a blocked for this pull request however.

@bosd
Copy link
Collaborator

bosd commented Oct 9, 2022

@rmilecki

  1. de.qualityhosting.yml parses lines incorrectly without this pull request changes

I agree with you on this one.

de.qualityhosting.yml parses lines incorrectly with those changes

I think that is something that needs to be fixed. as of in #378 where this pr is an alternative implementation of that.

I'll respectfully disagree with you on changing the invoice template.
Yes, the invoice template is not optimal.
(It's a different topic, but as performance wise the . character (meta escape) should be avoided in python regexes.)

In this case, it is possible to add a last line rule, as of your proposal in #422.
However, I found many practical cases where it is impossible to define a lastline in the template.
In that case, all non-matching lines (like footers) should be discarded, until it finds a new first_line match.

Maybe better to leave suboptimal tests and examples in this library. Just as an showcase. (Same goes for the OCR examples in this repo). It is definityle helping us to find these corner cases.

However, the template set aside.

My analysis of whats happening here.
Instead of passing one line to the regex pattern match, it looks like it is trying to match the regex across the whole content.
(Did not dive into the code yet to verify this).

The regex is as follows:
\s+
matches any whitespace character (equivalent to [\r\n\t\f\v ])

  • matches the previous token between one and unlimited times, as many times as possible.

When looking at the line:

QualityHosting AG                                 Vorstand: Christian Heit (Vorsitz),         Bankverbindung

It does not contain a whitespace character at the beginning of the line.
So it should not be used in the output.
(This is the implementation of #378).

However, if we feed the following into the parser:


QualityHosting AG                                 Vorstand: Christian Heit (Vorsitz),         Bankverbindung

An linebreak \n has been added in the line above our footer. Now the regex
\s+(?P<desc>.+)
Is matching on the first line of the footer. It is matching because it is using the linebreak from the previous line.

This should not happen, as the linebreak is clearly on another line.

@rmilecki
Copy link
Collaborator Author

rmilecki commented Oct 16, 2022

@bosd: in your analysis of \s+ treating I think you misread input invoice. That line: '^\s+(?P<desc>.+)$' in de.qualityhosting.yml works as you expected. It matches only those lines that start with whitespaces.


When looking at the line:

QualityHosting AG Vorstand: Christian Heit (Vorsitz), Bankverbindung

It does not contain a whitespace character at the beginning of the line.
So it should not be used in the output.

I agree with you. I believe you are correct. It should not be used in the output and it isn't used in the output.

So things work just like you expect them.


Please check this pdftotext output with my comments (scroll RIGHT please!):

       6                 1    Small Business QualityExchange 2010                                                      5,39           5,39
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_jauernik
                              01.05.14-31.05.14

QualityHosting AG                                 Vorstand: Christian Heit (Vorsitz),         Bankverbindung                    ← doesn't start with whitespace = doesn't appear in the output
Uferweg 40-42                                     Markus Oestreicher                          Kreissparkasse Gelnhausen         ← doesn't start with whitespace = doesn't appear in the output
D-63571 Gelnhausen                                Aufsichtsrat: Hans Jürgen                   Kto-Nr. 48567                     ← doesn't start with whitespace = doesn't appear in the output
Tel. +49 6051 916 44 10                           Habermann (Vorsitz)                         Blz: 507 500 94                   ← doesn't start with whitespace = doesn't appear in the output
Fax +49 6051 916 44 29                            Registergericht Hanau | HRB 13302           IBAN DE30507500940000048567       ← doesn't start with whitespace = doesn't appear in the output
Im Internet: www.qualityhosting.de                UStId DE 232 446 240                        SWIFT HELADEF1GEL                 ← doesn't start with whitespace = doesn't appear in the output
eMail: info@qualityhosting.de                     Steuer-Nr. 044 241 601 03

  QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen                                                                        ← starts with whitespace = appears in the output = expected




  iViveLabs Ltd.                                                                                                                ← starts with whitespace = appears in the output = expected
  93B Sai Yu Chung                                                                                                              ← starts with whitespace = appears in the output = expected
  Yuen Long, N.T.                                                                                                               ← starts with whitespace = appears in the output = expected
  Hong Kong                                                                                                                     ← starts with whitespace = appears in the output = expected




Rechnung                                                                                Seite 2                                 ← doesn't start with whitespace = doesn't appear in the output

Rechnungsnr.                 30064443                                                   Kundennr.                47774          ← doesn't start with whitespace = doesn't appear in the output
Rechnungsdatum               7. Mai 2014

   Pos.            Menge      Beschreibung                                                            Rabatt %     VK-Preis     Zeilenbetrag
                                                                                                                      Ohne      Ohne MwSt.
                                                                                                                     MwSt.
       7                 1    Small Business StandardExchange 2010                                                       3,89           3,89
                              Grundgebühr pro Einheit
                              Dienst: OUDJQ_office
                              01.05.14-31.05.14

So I think everything works correct and just like described that you expect them to.

@bosd
Copy link
Collaborator

bosd commented Oct 19, 2022

@rmilecki Thanks for the very clear information.
Makes sense to change the template, to only include a line when there are at least X amount of spaces on the beginning of the line.

@rmilecki
Copy link
Collaborator Author

@bosd: so could we have this one merged now, please?

It's a clear implementation, solves actual problem you reported, doesn't seem to regress anything. I find it a nice feature.

It's not meant to solve all cases our templates can't handle now. But it does solve one and I believe it's worth to have it.

We can work on handling more cases in further changes (e.g. #407, #423) but we need to start moving forward with something. I'm happy to discuss and work together on other cases later. At the same time I'd like to start merging proposed features.

@bosd
Copy link
Collaborator

bosd commented Oct 21, 2022

@rmilecki I want to move this one forward as well. As I really want to have this functionality.
Yet, first I would like to test is a bit more thourougly.
It is quite big, to test this, (will use real use case templates and invoices)
Hope I can do that this weekend, and give the approval then.

@bosd
Copy link
Collaborator

bosd commented Oct 21, 2022

@rmilecki Thanks for collaborating on this and all your efforts! Let's Merge!! 🎉 ✨

Tested this code against a bunch of pdf's and templates locally with great success!!!!

Some notes:

  1. The parsing of different "blocks" is not working with the following syntax.
lines:
 - start: Efficient Invoice Handling
   end:  5555 NH TommyCity
   line: (?P<test>(Sessamestreet 46))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)

I tried a variant with the new syntax, but could not make it work.
What would be the right syntax to parse multiple (different) blocks of lines??
Since this is related to #423 , which is about the parsing of multiple similar blocks. I would not consider it to block this pr.

  1. The documentation / examples need updating. To show the correct syntax to parse multiple lines.
    eg.
    line:
      - (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
      - ---(?P<line_note>.*ITEMS)---

(I will take care of no 2, as it is in my pipeline to provide a new real invoice)

So far "first_line", "line" and "last_line" could contain a single
RegEx only. Some invoices have lines that use more than one format. To
simplify parsin them allow all 3 entries to contain list of RegEx-es.

Example:
fields:
  lines:
    parser: lines
    start: Item\s+Discount\s+Price$
    end: \s+Total
    line:
      - Items group:\s+(?P<group>.+)
      - (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
@bosd bosd force-pushed the parsers-lines-multiple-regexes branch from a5558a4 to 81d4cfd Compare October 21, 2022 19:52
Copy link
Collaborator

@bosd bosd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested extensively! LGTM!! 🎉 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants