Wider numeric field format support #44

flywire · 2021-06-20T06:16:25Z

Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'. See #33.

Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'.

marlanperumal · 2021-06-20T10:24:42Z

pdf_statement_reader/parse.py

@@ -37,22 +38,24 @@ def get_raw_df(filename, num_pages, config):
 def clean_numeric(df, config):
 numeric_cols = [config["columns"][col] for col in config["cleaning"]["numeric"]]

- def format_negatives(s):


I know I haven't been good yet about writing tests for the project, but can you please add some tests to show that the format_currency_number method actually does what it's supposed to.

I have a natural aversion to regex, mostly because it generally results in the opposite of what python generally gives us - clean human readable code that's easy to easily visually grep what's going on. Python's built in string methods usually do a better job at this. In cases like this however with multiple complex rules, I'll concede that using regex may result in cleaner code. I'd only include it if there were tests against it though. Just reading it now, I'm not certain what all the side effects and edge cases might be.

I'll update with a sample test on the original format_negatives method that you can use as a basis.

I moved the format_negatives method outside of the clean_numeric method so that it could be tested but that's now caused a merge conflict with your branch. Should be easy enough to fix though.

Looking at that old code I wrote in parse.py there's certainly a lot to be desired in terms of it being robust to the different content that might come out of a pdf statement file

flywire added 2 commits June 20, 2021 16:16

Wider numeric field format support

6655d43

Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'.

Remove duplicate decimal_separator from regex

896c87b

marlanperumal reviewed Jun 20, 2021

View reviewed changes

flywire added 3 commits June 21, 2021 09:04

Layout - format_currency_number

b3301f9

test_format_currency_number

0d11332

Merge branch 'develop' into patch-3

566515d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wider numeric field format support #44

Wider numeric field format support #44

flywire commented Jun 20, 2021 •

edited

Loading

marlanperumal Jun 20, 2021

marlanperumal Jun 20, 2021

Wider numeric field format support #44

Are you sure you want to change the base?

Wider numeric field format support #44

Conversation

flywire commented Jun 20, 2021 • edited Loading

marlanperumal Jun 20, 2021

Choose a reason for hiding this comment

marlanperumal Jun 20, 2021

Choose a reason for hiding this comment

flywire commented Jun 20, 2021 •

edited

Loading