Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wider numeric field format support #44

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

flywire
Copy link
Contributor

@flywire flywire commented Jun 20, 2021

Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'. See #33.

Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'.
@@ -37,22 +38,24 @@ def get_raw_df(filename, num_pages, config):
def clean_numeric(df, config):
numeric_cols = [config["columns"][col] for col in config["cleaning"]["numeric"]]

def format_negatives(s):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I haven't been good yet about writing tests for the project, but can you please add some tests to show that the format_currency_number method actually does what it's supposed to.

I have a natural aversion to regex, mostly because it generally results in the opposite of what python generally gives us - clean human readable code that's easy to easily visually grep what's going on. Python's built in string methods usually do a better job at this. In cases like this however with multiple complex rules, I'll concede that using regex may result in cleaner code. I'd only include it if there were tests against it though. Just reading it now, I'm not certain what all the side effects and edge cases might be.

I'll update with a sample test on the original format_negatives method that you can use as a basis.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the format_negatives method outside of the clean_numeric method so that it could be tested but that's now caused a merge conflict with your branch. Should be easy enough to fix though.

Looking at that old code I wrote in parse.py there's certainly a lot to be desired in terms of it being robust to the different content that might come out of a pdf statement file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants