Support for multi type (Unions) in schemas and validation #1152

vianmixtkz · 2023-04-04T14:25:45Z

Is your feature request related to a problem? Please describe.

I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types.
Pydantic allows it.

Here an example of my issue

from typing import Union
import pandas as pd
import pandera as pa
from pandera.typing import Series

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Series[Union[str, float]] = pa.Field()

class OutputSchema(InputSchema):
    revenue: Series[float]

df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
    "comment":["test", float("nan"), "test"]
})

InputSchema(df) # raises TypeError Cannot interpret 'typing.Union[str, float]' as a data type

Describe the solution you'd like

I think it is the desired behavior for now to not allow Unions. But could you consider an option to allow it in the future ?

Describe alternatives you've considered

Split the Union columns into multiple columns, one for each type but this is not really something that I can control. Cf next section.

Additional context

I have a valid use case for this. I am using pandas to handle CSVs where some columns contain hybrid data types.
I am using pandas for the preprocessing and pydantic for the validation, and I would like to use pandera to make this process (processing + validation) more robust

johnkangw · 2023-04-06T20:47:53Z

@vianmixtkz Great writeup. This is something that would be great for Pandera to support.

cosmicBboy · 2023-04-06T21:08:33Z

Thanks @vianmixtkz this is an interesting use case: the way pandas handles mixed-type columns is to represent the data in an object dtype column.

One thing we should clarify in the semantics of this feature is the following: we can interpret Union[str, float] either as:

the column is either a str column or a float column
the column is an object column that contains either str or float values

Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:

if the column is str dtype, then pass
if the column is float dtype, then pass
if the column is object data type, check that values are str or float. If so, then pass.
fail if none of the above conditions are met.

vianmixtkz · 2023-04-06T21:40:14Z

Here what I described is matching case 2. That's is in a given column, I'll have for example str on some rows and floats on other rows.
But it would be nice to support both cases anyway.

With something like:

Case 1

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Union[Series[str], Series[float]] = pa.Field() # comment is either only str or only float in a given DataFrame

Case 2

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Series[Union[str,float]] = pa.Field() # comment is a column containing str on some rows and float on other rows

And yeah, I think the behavior you are describing is what users would expect

Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:

if the column is str dtype, then pass # passes in case 1 and 2
if the column is float dtype, then pass # passes in case 1 and 2
if the column is object data type, check that values are str or float. If so, then pass. # passes only in case 2
fail if none of the above conditions are met.

fix: unionai-oss#1152 I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types. 1. Add a new PythonUnion type. 2. Add a new test to for the new UnionType. Signed-off-by: karajan1001 <mishanyo1001@gmail.com>

aaravind100 · 2023-10-05T12:17:17Z

Just bumping this thread.

Any consensus how to proceed? Seem like the #1227 is stale.

cosmicBboy · 2024-03-30T15:56:22Z

Revisiting this issue and thinking about it a little bit, here's another proposal for this issue:

from pandera.engines.pandas_engine import Object
from typing import Annotated

class Model(pa.DataFrameModel):
    union_column : Union[str, float]  # the column data type must be either a str or float

    object_column: Object = pa.Field(dtype_kwargs={"allowable_types": [str, float]})
    # or use the annotated types
    object_column: Annotated[Object, [str, float]]

This syntax is less ambiguous as to what the actual type of the column is vs. the values within it are. However, it does require importing a special Object type.

I'm still open to the more ambiguous behavior where Union[str, float] would cover all of these cases though. Open to further discussion on this!

cosmicBboy · 2024-03-30T16:38:23Z

Re: this proposal: #1152 (comment)

Unfortunately col: Series[TYPE] and col: TYPE in a DataFrameModel are equivalent so Union[Series[str], Series[float]] and Series[Union[str,float]] would effectively be equivalent, and would also introduce more complexity to the handling of types in DataFrameModel, which I don't think would be worth it.

aaravind100 · 2024-04-01T17:09:52Z

I'm not a fan of this case Union[Series[str], Series[float]] from this comment, where the series would consists of only string or only float. Its very ambiguous, the output would sorta change depending on what data you pass. These could be very well their own distinct schema.

Series[Union[str, float]] or Union[str, float] or str | float # python 3.10+, where the output could be either string or float. This case is more consistent.

erinboyle · 2024-08-23T22:41:54Z

I am pretty sure I need this as well. I'm trying to create a DataFrameModel that expects a timestamp in the index. My ideal would be to validate that it has a timezone, but not specify which timezone. Another acceptable type hint would be to accept any timestamp, tz-naive or tz-aware, and then add custom checks around timezone manually. But right now, if I use Index[pd.DatetimeTZDtype], a specific timezone must be provided, and pd.Timestamp fails if any timezone is provided. I'd love to hint a union of the two.

westaustint · 2024-11-26T14:39:36Z

Propose something that isnt exactly just check valid type(s): Would be if it doesnt fall into the set datatype there is a set of values that is acceptable.

float but field can contain "NA"

Cant think of a personal use case to allow for all strings but where maybe data wasnt provided or invalid. Still want to check all the values in the column but dont want to have to edit all the strings to null or 0.

vianmixtkz added the enhancement New feature or request label Apr 4, 2023

karajan1001 linked a pull request Jun 20, 2023 that will close this issue

Support for Unions in schemas and validation #1227

Open

robertdj mentioned this issue Mar 28, 2024

Validating datetime columns regardless of timezone #1543

Closed

erinboyle mentioned this issue Aug 23, 2024

Pandera timezone-agnostic datetime type #1352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multi type (Unions) in schemas and validation #1152

Support for multi type (Unions) in schemas and validation #1152

vianmixtkz commented Apr 4, 2023

johnkangw commented Apr 6, 2023

cosmicBboy commented Apr 6, 2023

vianmixtkz commented Apr 6, 2023

aaravind100 commented Oct 5, 2023

cosmicBboy commented Mar 30, 2024 •

edited

Loading

cosmicBboy commented Mar 30, 2024

aaravind100 commented Apr 1, 2024

erinboyle commented Aug 23, 2024

westaustint commented Nov 26, 2024

Support for multi type (Unions) in schemas and validation #1152

Support for multi type (Unions) in schemas and validation #1152

Comments

vianmixtkz commented Apr 4, 2023

johnkangw commented Apr 6, 2023

cosmicBboy commented Apr 6, 2023

vianmixtkz commented Apr 6, 2023

aaravind100 commented Oct 5, 2023

cosmicBboy commented Mar 30, 2024 • edited Loading

cosmicBboy commented Mar 30, 2024

aaravind100 commented Apr 1, 2024

erinboyle commented Aug 23, 2024

westaustint commented Nov 26, 2024

cosmicBboy commented Mar 30, 2024 •

edited

Loading