Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling empty string in DateFormatValidation #40

Open
Natalie-Caruana opened this issue Oct 3, 2020 · 5 comments
Open

Handling empty string in DateFormatValidation #40

Natalie-Caruana opened this issue Oct 3, 2020 · 5 comments
Labels

Comments

@Natalie-Caruana
Copy link

Hi,
I am experimenting with the DateFormatValidation validator on the following test example:

import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation

schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M')],allow_empty=True)])

test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
    print(error)

It seems to me that there is a bug with regards to handling empty string '' as I'm not getting back any errors. I do get an error however if I use double whitespace for example. But it looks like the validator considers empty string as NULL.
Other than that thanks alot for this package really handy!

@multimeric
Copy link
Owner

Okay so the current allow_empty behaviour is this:
https://github.com/TMiguelT/PandasSchema/blob/eae711871e350519d6df48547d3000a7b6fedcb3/pandas_schema/validation.py#L86-L92

So basically allow_empty only checks for null in numeric type and categorical types. For the others we check for 0 string length. But in hindsight, I'm not really sure why I did this, because it is possible for a dtype=str Series to have null values, even if you do so via read_csv.

Unfortunately this would be a breaking change so I can't fix it without a major version, but I'll try to work out something more logical for 1.0.0.

In the meantime, you can use some kind of boolean logic to work around this. Maybe DateFormatValidation('%Y-%m-%d') & CustomSeriesValidation(lambda s: s.str.len == 0, 'had an empty string')

@multimeric multimeric added the bug label Oct 3, 2020
@Natalie-Caruana
Copy link
Author

I tried adding the CustomSeriesValidation as shown below:

import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation,CustomSeriesValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d'),CustomSeriesValidation(lambda s: len(str(s))==0,'has an empty string')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
    print(error)

however i'm getting the following output

{row: 1, column: "Date"}: "2009-01-04" has an empty string

the error does not seem to be pointing to the correct location.

@multimeric
Copy link
Owner

Oh, is should be CustomSeriesValidation(lambda s: s.str.len != 0, 'had an empty string') (not equals, rather than equals)

@Natalie-Caruana
Copy link
Author

Replacing "==" with "!=" does not return any errors 😕

import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation,CustomSeriesValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d'),CustomSeriesValidation(lambda s: len(str(s))!=0,'has an empty string')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M'),CustomSeriesValidation(lambda s: len(str(s))!=0,'has an empty string')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
    print(error)

@multimeric
Copy link
Owner

Oh right. I think the allow_empty is overriding the empty string behaviour. At this point I think it's easiest to remove allow_empty, and or two validations together. Indicating that this series is valid if it's null or if it's a valid date, but not otherwise.

Also note that in your example you're not using the CustomSeriesValidation correctly, as the argument s is a pandas Series, and not a Python string, so you can't use len in that way (or at least I don't think you can?)

Column(
    'Datetime',
    [
        CustomSeriesValidation(lambda s: s.isnull(), 'is not null') | 
        DateFormatValidation('%Y-%m-%d %H:%M')
    ], allow_empty=False
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants