Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_output : option to turn down checks if output is empty #332

Open
ClaireGouze opened this issue Nov 24, 2020 · 10 comments
Open

check_output : option to turn down checks if output is empty #332

ClaireGouze opened this issue Nov 24, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@ClaireGouze
Copy link

I'm using the check_output function to check column & datatypes of the DataFrameSchema. My function output can sometimes be an empty dataframe and thus gets a SchemaError, though i would want no error.

Would it be possible to have an option in the check_output function so that no error is raised if output is empty ? Or in the DataFrameSchema ?

Thank you !

@ClaireGouze ClaireGouze added the enhancement New feature or request label Nov 24, 2020
@cosmicBboy
Copy link
Collaborator

thanks for submitting this feature request @ClaireGouze!

I think this use case should be supported, and here are a few a potential solutions:

  1. add an allow_empty property to the DataFrameSchema and SeriesSchema initializers, such that empty dataframes can
    pass through without raising a SchemaError. This is nice because it would then cover the check_input case as well.
  2. add an optional option to the check_* decorators, resulting in the same behavior.

I'm leaning on (1), mainly because (2) sort of conflicts with the semantics of Optional[<TYPE>] in the typing module implies that the value can be either None or the <TYPE> specified. allow_empty on the other hand would hold a pandas-specific meaning, which is conceptually cleaner than overloading the "optional" terminology.

Let me know what you think!

@jeffzi
Copy link
Collaborator

jeffzi commented Nov 25, 2020

I'm using the check_output function to check column & datatypes of the DataFrameSchema

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

import pandera as pa
import pandas as pd


schema = pa.DataFrameSchema({"A": pa.Column(int)})


@pa.check_output(schema)
def make_empty() -> pd.DataFrame:
    return pd.DataFrame({"A": []})


try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_output decorator of function 'make_empty': expected series 'A' to have type int64, got float64

schema_coerced = pa.DataFrameSchema({"A": pa.Column(int)}, coerce=True)


@pa.check_output(schema_coerced)
def make_empty_coerced() -> pd.DataFrame:
    return pd.DataFrame({"A": []})


make_empty_coerced()  # ok
#> Empty DataFrame
Columns: [A]
Index: []

Created on 2020-11-25 by the reprexpy package

If the DataFrame is empty, we can only validate names and types. I think an argument allow_empty should still validate types. Pandera could offer a helper method DataFrameSchema.coerce_dtypes() to let the user coerce locally when the DataFrame is empty. That way coerce can be kept to False globally if that's desirable.

Regarding solution 2., one problem is that you would allow empty DataFrames locally but later validations could fail if optional=True was not set down the pipeline. Moreover, DataFrameSchema.validate() would also need an optional argument if we want to keep a 1:1 mapping with decorator functionalities.

@cosmicBboy
Copy link
Collaborator

I think an argument allow_empty should still validate types.

👍

@ClaireGouze
Copy link
Author

Thanks for your reply, i think the solution #1 you mentioned would be suitable.

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

@jeffzi
Copy link
Collaborator

jeffzi commented Nov 26, 2020

What you are asking for is actually to completely disable validation.

I propose to introduce both arguments:

  1. Argument allow_empty for DataFrameSchema/SeriesSchema which still checks names and types on empty DataFrames. Example use cases are dry runs or reading from a source that can be empty. The semantic is that we processed the data successfully but the output is empty.

  2. Argument optional for all check decorators which disables validation when passed a None object. That behavior would be aligned with typing.Optional. The semantic is slightly different than 1. It would signal the fact that we could not process the DataFrame but that's within expectations therefore we do not want to raise an error.

SchemaModel coupled with the decorator check_types already implements 2.

import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
from typing import Optional


class Schema(pa.SchemaModel):
    A: Series[int]


@pa.check_types()
def make_empty() -> Optional[DataFrame[Schema]]:
    return pd.DataFrame()


try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_types decorator of function 'make_empty': column 'A' not in dataframe
#> Empty DataFrame
#> Columns: []
#> Index: []


@pa.check_types()
def maybe_df() -> Optional[DataFrame[Schema]]:
    return None


maybe_df() # ok

Created on 2020-11-26 by the reprexpy package

@cosmicBboy
Copy link
Collaborator

I think the allow_empty option at the schema-level and optional option for object-based API check_* decorators makes sense.

For the latter, I'm thinking something like this:

import pandas as pd
import pandera as pa

from typing import Optional


schema = pa.DataFrameSchema({
    "col": pa.Column(int)
})


@pa.check_input(schema, optional=True)
def check_input_transform(df):  # or None
    return df


@pa.check_output(schema, optional=True)
def check_output_transform(df):
    return df  # or None


@pa.check_io(df=schema, out=schema, optional={"df": True, "out": True})
def check_output_transform(df):
    return df  # or None


@pa.check_io(
    df=schema, out=(1, schema), optional={"df": True, "out": {1: True}}
)
def check_tuple_output_transform(df):  # or None
    return "foo", df  # or None


@pa.check_io(
    df=schema, out=("bar", schema), optional={"df": True, "out": {"bar": True}}
)
def check_mapping_output_transform(df):  # or None
    return {
        "foo": 1,
        "bar": df,  # or None
    }

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

@ClaireGouze can you provide example code for your use case? I'm trying to wrap my head around the case where a function returns an empty dataframe with no columns, in which case my intuition is that the function should return None instead of pd.DataFrame()

@cosmicBboy
Copy link
Collaborator

going to work on this after 0.6.0 release, should be out by next week

@ndepaola
Copy link

ndepaola commented Mar 1, 2023

what's the status of this issue? at my work, we have a data manipulation function which returns a dataframe that should follow a schema, and we use check_types to validate the dataframe against the schema - however, the validator fails when the dataframe is empty (an empty dataframe is a valid output from the function). a column that's typically typed as float gets the pandas dtype object when the dataframe is empty. we can work around this in the short-term by coercing the type on that column, but this will continue to cause issues for us going forward.

@einarjohnson
Copy link

einarjohnson commented Jan 16, 2024

+1 on this, also facing this issue when empty dataframes are being used. is the suggested solution in the current version of pandera to use the required keyword and specify all columns to be false with it? https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required

@arkban
Copy link

arkban commented Jul 19, 2024

I ran into this as well.

It's an extra confusing error because the error message implies that the table is not empty, at least the error I get:

expected series 'xxx' to have type datetime64[ns], got object"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants