Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty (all-null) or missing fields with Python (non-Pandas) types fail validation despite coerce=True and nullable=True #1866

Open
2 of 3 tasks
stainbank opened this issue Nov 29, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@stainbank
Copy link

stainbank commented Nov 29, 2024

Describe the bug
Pandera raises SchemaError when passed data with an entirely empty or missing non-Pandas type column, despite use of coerce=True, nullable=True and add_missing_columns=True, whereas an equivalent Pandas-type column is filled with null values as expected.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import datetime as dt

import pandas as pd
import pandera as pa


class Schema(pa.DataFrameModel):
    class Config:
        coerce = True
        add_missing_columns = True

    timestamp: pa.Timestamp = pa.Field(nullable=True)
    date: dt.date = pa.Field(nullable=True)


"""Columns present but null."""

complete = dict(timestamp=dt.datetime(1964, 8, 24), date=dt.datetime(1964, 8, 24))
null_timestamp = dict(complete, timestamp=None)
null_date = dict(complete, date=None)

# ✅ everything coerced as expected:
print(Schema.validate(pd.DataFrame([complete])).dtypes)
# ->
# timestamp    datetime64[ns]
# date                 object
# dtype: object

# ✅null-only `timestamp` column still coerced as expected
print(Schema.validate(pd.DataFrame([null_timestamp])))
# ->
#   timestamp        date
# 0       NaT  1964-08-24

# ✅ with a mix of null and non-null, `date` column still coerced as expected
print(Schema.validate(pd.DataFrame([complete, null_date])))
# ->
#    timestamp        date
# 0 1964-08-24  1964-08-24
# 1 1964-08-24         NaT

# ❗️ with all nulls, `date` fails
print(Schema.validate(pd.DataFrame([null_date])))
# ->
# pandera.errors.SchemaError: expected series 'date' to have type date:
# failure cases:
#    index failure_case
# 0      0          NaT


"""Columns missing."""

missing_timestamp, missing_date = complete.copy(), complete.copy()
missing_timestamp.pop("timestamp")
missing_date.pop("date")

# ✅ missing `timestamp` column created as expected
print(Schema.validate(pd.DataFrame([missing_timestamp])))
#   timestamp        date
# 0       NaT  1964-08-24

# ❗️ missing `date` column fails
print(Schema.validate(pd.DataFrame([missing_date])))
# ->
# pandera.errors.SchemaError: expected series 'date' to have type date:
# failure cases:
#    index failure_case
# 0      0          NaT

This was run with:

  • pandas==2.2.3"
  • pandera==0.21.0

Expected behavior

In the example above, I would expect the non-Pandas type column (date) to behave identically to a pandas type column (timestamp) i.e. here be filled with a null value (NaT) when:

  • all values are null
  • the column is missing entirely

This is not specific to date types, that's just for illustration; you can swap out for int/pa.Int etc.

Desktop (please complete the following information):

  • OS: Mac OS X
@stainbank stainbank added the bug Something isn't working label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant