check_output : option to turn down checks if output is empty #332

ClaireGouze · 2020-11-24T20:20:33Z

I'm using the check_output function to check column & datatypes of the DataFrameSchema. My function output can sometimes be an empty dataframe and thus gets a SchemaError, though i would want no error.

Would it be possible to have an option in the check_output function so that no error is raised if output is empty ? Or in the DataFrameSchema ?

Thank you !

cosmicBboy · 2020-11-25T04:59:23Z

thanks for submitting this feature request @ClaireGouze!

I think this use case should be supported, and here are a few a potential solutions:

add an allow_empty property to the DataFrameSchema and SeriesSchema initializers, such that empty dataframes can
pass through without raising a SchemaError. This is nice because it would then cover the check_input case as well.
add an optional option to the check_* decorators, resulting in the same behavior.

I'm leaning on (1), mainly because (2) sort of conflicts with the semantics of Optional[<TYPE>] in the typing module implies that the value can be either None or the <TYPE> specified. allow_empty on the other hand would hold a pandas-specific meaning, which is conceptually cleaner than overloading the "optional" terminology.

Let me know what you think!

jeffzi · 2020-11-25T10:03:52Z

I'm using the check_output function to check column & datatypes of the DataFrameSchema

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

import pandera as pa
import pandas as pd


schema = pa.DataFrameSchema({"A": pa.Column(int)})


@pa.check_output(schema)
def make_empty() -> pd.DataFrame:
    return pd.DataFrame({"A": []})


try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_output decorator of function 'make_empty': expected series 'A' to have type int64, got float64

schema_coerced = pa.DataFrameSchema({"A": pa.Column(int)}, coerce=True)


@pa.check_output(schema_coerced)
def make_empty_coerced() -> pd.DataFrame:
    return pd.DataFrame({"A": []})


make_empty_coerced()  # ok
#> Empty DataFrame
Columns: [A]
Index: []

^{Created on 2020-11-25 by the reprexpy package}

If the DataFrame is empty, we can only validate names and types. I think an argument allow_empty should still validate types. Pandera could offer a helper method DataFrameSchema.coerce_dtypes() to let the user coerce locally when the DataFrame is empty. That way coerce can be kept to False globally if that's desirable.

Regarding solution 2., one problem is that you would allow empty DataFrames locally but later validations could fail if optional=True was not set down the pipeline. Moreover, DataFrameSchema.validate() would also need an optional argument if we want to keep a 1:1 mapping with decorator functionalities.

cosmicBboy · 2020-11-25T14:28:58Z

I think an argument allow_empty should still validate types.

👍

ClaireGouze · 2020-11-26T08:56:38Z

Thanks for your reply, i think the solution #1 you mentioned would be suitable.

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

jeffzi · 2020-11-26T11:01:19Z

What you are asking for is actually to completely disable validation.

I propose to introduce both arguments:

Argument allow_empty for DataFrameSchema/SeriesSchema which still checks names and types on empty DataFrames. Example use cases are dry runs or reading from a source that can be empty. The semantic is that we processed the data successfully but the output is empty.
Argument optional for all check decorators which disables validation when passed a None object. That behavior would be aligned with typing.Optional. The semantic is slightly different than 1. It would signal the fact that we could not process the DataFrame but that's within expectations therefore we do not want to raise an error.

SchemaModel coupled with the decorator check_types already implements 2.

import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
from typing import Optional


class Schema(pa.SchemaModel):
    A: Series[int]


@pa.check_types()
def make_empty() -> Optional[DataFrame[Schema]]:
    return pd.DataFrame()


try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_types decorator of function 'make_empty': column 'A' not in dataframe
#> Empty DataFrame
#> Columns: []
#> Index: []


@pa.check_types()
def maybe_df() -> Optional[DataFrame[Schema]]:
    return None


maybe_df() # ok

^{Created on 2020-11-26 by the reprexpy package}

cosmicBboy · 2020-11-26T15:41:38Z

I think the allow_empty option at the schema-level and optional option for object-based API check_* decorators makes sense.

For the latter, I'm thinking something like this:

import pandas as pd
import pandera as pa

from typing import Optional


schema = pa.DataFrameSchema({
    "col": pa.Column(int)
})


@pa.check_input(schema, optional=True)
def check_input_transform(df):  # or None
    return df


@pa.check_output(schema, optional=True)
def check_output_transform(df):
    return df  # or None


@pa.check_io(df=schema, out=schema, optional={"df": True, "out": True})
def check_output_transform(df):
    return df  # or None


@pa.check_io(
    df=schema, out=(1, schema), optional={"df": True, "out": {1: True}}
)
def check_tuple_output_transform(df):  # or None
    return "foo", df  # or None


@pa.check_io(
    df=schema, out=("bar", schema), optional={"df": True, "out": {"bar": True}}
)
def check_mapping_output_transform(df):  # or None
    return {
        "foo": 1,
        "bar": df,  # or None
    }

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

@ClaireGouze can you provide example code for your use case? I'm trying to wrap my head around the case where a function returns an empty dataframe with no columns, in which case my intuition is that the function should return None instead of pd.DataFrame()

cosmicBboy · 2020-12-03T01:11:07Z

going to work on this after 0.6.0 release, should be out by next week

ndepaola · 2023-03-01T05:12:42Z

what's the status of this issue? at my work, we have a data manipulation function which returns a dataframe that should follow a schema, and we use check_types to validate the dataframe against the schema - however, the validator fails when the dataframe is empty (an empty dataframe is a valid output from the function). a column that's typically typed as float gets the pandas dtype object when the dataframe is empty. we can work around this in the short-term by coercing the type on that column, but this will continue to cause issues for us going forward.

einarjohnson · 2024-01-16T07:48:16Z

+1 on this, also facing this issue when empty dataframes are being used. is the suggested solution in the current version of pandera to use the required keyword and specify all columns to be false with it? https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required

arkban · 2024-07-19T16:08:14Z

I ran into this as well.

It's an extra confusing error because the error message implies that the table is not empty, at least the error I get:

expected series 'xxx' to have type datetime64[ns], got object"

ClaireGouze added the enhancement New feature or request label Nov 24, 2020

cosmicBboy added the help wanted Extra attention is needed label Dec 24, 2020

cosmicBboy added this to the 0.7.0 Release milestone Jan 12, 2021

cosmicBboy mentioned this issue Jan 12, 2021

Support built-in dataframe checks and dataframe-level options #383

Closed

cosmicBboy self-assigned this Jan 24, 2021

cosmicBboy modified the milestones: 0.7.0 Release, 0.8.0 release Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_output : option to turn down checks if output is empty #332

check_output : option to turn down checks if output is empty #332

ClaireGouze commented Nov 24, 2020

cosmicBboy commented Nov 25, 2020

jeffzi commented Nov 25, 2020 •

edited

Loading

cosmicBboy commented Nov 25, 2020

ClaireGouze commented Nov 26, 2020

jeffzi commented Nov 26, 2020 •

edited

Loading

cosmicBboy commented Nov 26, 2020

cosmicBboy commented Dec 3, 2020

ndepaola commented Mar 1, 2023 •

edited

Loading

einarjohnson commented Jan 16, 2024 •

edited

Loading

arkban commented Jul 19, 2024

check_output : option to turn down checks if output is empty #332

check_output : option to turn down checks if output is empty #332

Comments

ClaireGouze commented Nov 24, 2020

cosmicBboy commented Nov 25, 2020

jeffzi commented Nov 25, 2020 • edited Loading

cosmicBboy commented Nov 25, 2020

ClaireGouze commented Nov 26, 2020

jeffzi commented Nov 26, 2020 • edited Loading

cosmicBboy commented Nov 26, 2020

cosmicBboy commented Dec 3, 2020

ndepaola commented Mar 1, 2023 • edited Loading

einarjohnson commented Jan 16, 2024 • edited Loading

arkban commented Jul 19, 2024

jeffzi commented Nov 25, 2020 •

edited

Loading

jeffzi commented Nov 26, 2020 •

edited

Loading

ndepaola commented Mar 1, 2023 •

edited

Loading

einarjohnson commented Jan 16, 2024 •

edited

Loading