-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_output : option to turn down checks if output is empty #332
Comments
thanks for submitting this feature request @ClaireGouze! I think this use case should be supported, and here are a few a potential solutions:
I'm leaning on (1), mainly because (2) sort of conflicts with the semantics of Let me know what you think! |
If you don't have explicit checks, i.e. just checking column names and types, you could set import pandera as pa
import pandas as pd
schema = pa.DataFrameSchema({"A": pa.Column(int)})
@pa.check_output(schema)
def make_empty() -> pd.DataFrame:
return pd.DataFrame({"A": []})
try:
make_empty() # fails
except pa.errors.SchemaError as ex:
print(ex)
#> error in check_output decorator of function 'make_empty': expected series 'A' to have type int64, got float64
schema_coerced = pa.DataFrameSchema({"A": pa.Column(int)}, coerce=True)
@pa.check_output(schema_coerced)
def make_empty_coerced() -> pd.DataFrame:
return pd.DataFrame({"A": []})
make_empty_coerced() # ok
#> Empty DataFrame
Columns: [A]
Index: [] Created on 2020-11-25 by the reprexpy package If the DataFrame is empty, we can only validate names and types. I think an argument Regarding solution 2., one problem is that you would allow empty DataFrames locally but later validations could fail if |
👍 |
Thanks for your reply, i think the solution #1 you mentioned would be suitable.
This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail. |
What you are asking for is actually to completely disable validation. I propose to introduce both arguments:
SchemaModel coupled with the decorator check_types already implements 2. import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
from typing import Optional
class Schema(pa.SchemaModel):
A: Series[int]
@pa.check_types()
def make_empty() -> Optional[DataFrame[Schema]]:
return pd.DataFrame()
try:
make_empty() # fails
except pa.errors.SchemaError as ex:
print(ex)
#> error in check_types decorator of function 'make_empty': column 'A' not in dataframe
#> Empty DataFrame
#> Columns: []
#> Index: []
@pa.check_types()
def maybe_df() -> Optional[DataFrame[Schema]]:
return None
maybe_df() # ok Created on 2020-11-26 by the reprexpy package |
I think the For the latter, I'm thinking something like this: import pandas as pd
import pandera as pa
from typing import Optional
schema = pa.DataFrameSchema({
"col": pa.Column(int)
})
@pa.check_input(schema, optional=True)
def check_input_transform(df): # or None
return df
@pa.check_output(schema, optional=True)
def check_output_transform(df):
return df # or None
@pa.check_io(df=schema, out=schema, optional={"df": True, "out": True})
def check_output_transform(df):
return df # or None
@pa.check_io(
df=schema, out=(1, schema), optional={"df": True, "out": {1: True}}
)
def check_tuple_output_transform(df): # or None
return "foo", df # or None
@pa.check_io(
df=schema, out=("bar", schema), optional={"df": True, "out": {"bar": True}}
)
def check_mapping_output_transform(df): # or None
return {
"foo": 1,
"bar": df, # or None
}
@ClaireGouze can you provide example code for your use case? I'm trying to wrap my head around the case where a function returns an empty dataframe with no columns, in which case my intuition is that the function should return |
going to work on this after |
what's the status of this issue? at my work, we have a data manipulation function which returns a dataframe that should follow a schema, and we use |
+1 on this, also facing this issue when empty dataframes are being used. is the suggested solution in the current version of pandera to use the |
I ran into this as well. It's an extra confusing error because the error message implies that the table is not empty, at least the error I get:
|
I'm using the check_output function to check column & datatypes of the DataFrameSchema. My function output can sometimes be an empty dataframe and thus gets a SchemaError, though i would want no error.
Would it be possible to have an option in the check_output function so that no error is raised if output is empty ? Or in the DataFrameSchema ?
Thank you !
The text was updated successfully, but these errors were encountered: