-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple pandera and pandas type systems #369
Comments
I'd like to suggest another use case for a refactor of pandera dtypes. I use pandera to validate pandas DataFrames that are ultimately written as parquet files via pyarrow. Parquet supports date and decimal types which are not natively supported by pandas but can be stored in Example for date: from typing import Optional
import pandas as pd
import pandera as pa
class DateColumn(pa.Column):
"""Column containing date values encoded as date types (not datetime)."""
def __init__(
self,
checks: pa.schemas.CheckList = None,
nullable: bool = False,
allow_duplicates: bool = True,
coerce: bool = False,
required: bool = True,
name: Optional[str] = None,
regex: bool = False,
) -> None:
super().__init__(
pa.Object, # <===========
checks=checks,
nullable=nullable,
allow_duplicates=allow_duplicates,
coerce=coerce,
required=required,
name=name,
regex=regex,
)
def coerce_dtype(self, series: pd.Series) -> pd.Series:
"""Coerce a pandas.Series to date types."""
try:
dttms = pd.to_datetime(series, infer_datetime_format=True, utc=True)
except TypeError as err:
msg = f"Error while coercing '{self.name} to type'date'"
raise TypeError(msg) from err
return dttms.dt.date
schema = pa.DataFrameSchema(columns={"dt": DateColumn(coerce=True)})
df = pd.DataFrame({'dt':pd.date_range("2021-01-01", periods=1, freq="H")})
print(df)
#> dt
#> 0 2021-01-01
print(df["dt"].dtype)
#> datetime64[ns]
df = schema.validate(df)
print(df)
#> dt
#> 0 2021-01-01
print(df["dt"].dtype)
#> object Created on 2021-01-07 by the reprexpy package The issue with the above is that custom columns are not compatible with I suggest to move the #376 already lists a couple of solutions to pass arguments to dtypes in |
Cool, thanks for describing this use case!
👍 to sketch out some ideas for the dtype class: import pandas as pd
from enum import Enum
# abstract class spec, should support types for other dataframe-like data structures
# e.g. spark dataframes, dask, ray, vaex, xarray, etc.
class DataType:
def __call__(self, obj): # obj should be an arbitrary object
"""Coerces object to the dtype."""
raise NotImplementedError
def __eq__(self, other):
# some default hash implementation
pass
def __hash__(self):
# some default hash implementation
pass
class PandasDataType(DataType):
def __init__(self, str_alias):
self.str_alias = str_alias
def __call__(self, obj):
# obj should be a pandas DataFrame, Series, or Index
return obj.astype(self.str_alias)
# re-implementation of dtypes.PandasDtype, which is currently an Enum class,
# preserving the PandasDtype class for backwards compatibility
class PandasDtype:
# See the pandas dtypes docs for more information:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes
Bool = PandasDataType("bool")
DateTime = PandasDataType("datetime64[ns]")
Timedelta = PandasDataType("timedelta64[ns]")
# etc. for the rest of the data-types that don't need additional arguments
# use static methods for datatypes with additional arguments
@staticmethod
def DatetimeTZ(tz="utc"):
return PandasDataType(f"datetime64[ns, <{tz}>]")
@staticmethod
def Period(freq):
pass
@staticmethod
def Interval(numpy_dtype=None, tz=None, freq=None):
pass
@staticmethod
def Categorical(categories=None, ordered=False):
pass |
The point would be to minimize the impact on the implementation, isn't it? The Enum is not mentioned in the documentation (besides the API section). All examples exclusively use the aliases such as This is what I have in mind: class PandasDataType(DataType): # can be renamed to PandasDtype if it facilitates implementation
...
Bool = PandasDataType("bool")
... # other straightforward dtypes
class DatetimeTZ(PandasDataType):
def __init__(self, tz="utc"):
super(f"datetime64[ns, <{tz}>]")
self.tz = tz # in case we need it for other methods
class Datetime(PandasDataType):
# args forwarded to pd.to_datetime, used for better coercion (if coerce=True)
def __init__(self, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, ...):
...
def __call__(self, obj):
return pd.to_datetime(obj, dayfirst=self.day_first, ...) |
I'm all for simplification, I think I didn't take into account the fact that most users probably never use Another place Getting away from the |
hey @jeffzi, just wanted to bring your attention to the It's written by the same people who wrote |
Thanks @cosmicBboy, I did not know about Before presenting my evaluation of
Now, regarding
I think I would keep the following ideas:
|
Just adding my thoughts here @jeffzi for the record
Agreed! Let's go with our own class hierarchy, something like what we've discussed in #369 (comment) and #369 (comment)
Would love this, we can tackle these once the major refactor for existing dtypes is done. (would also love a
+1 to this idea, we can also cross that bridge when we get there. Another thought I had was having it as a class attribute class DataFrameSchema():
allowed_dtypes = [...] Let me know if you need any help with discussing approach/architecture! |
I'm ready to share a proposal of dtype refactor. I iterated several times on the design and I have now a good base for the discussion. Full draft implementation is in this gist. Here are the main ideas:
@dataclass(frozen=True)
class Category(DataType):
categories: Tuple[Any] = None # immutable sequence to ensure safe hash
ordered: bool = False
def __post_init__(self) -> "Category":
categories = tuple(self.categories) if self.categories is not None else None
# bypass frozen dataclass
# see https://docs.python.org/3/library/dataclasses.html#frozen-instances
object.__setattr__(self, "categories", categories)
@dataclass(frozen=True)
class PandasDtype: # Generic dtype in case the user supplies an unknown dtype.
native_dtype: Any = None # get pandas native dtype (useful for strategy module)
def coerce(self, obj: PandasObject) -> PandasObject:
return obj.astype(self.native_dtype)
@PandasBackend.register( # conversion for default Category
Category, Category(), # pandera.Category
pd.CategoricalDtype, pd.CategoricalDtype()
)
@dataclass(frozen=True)
class PandasCategory(PandasDtype, Category):
def __post_init__(self) -> "PandasDtype":
super().__post_init__()
object.__setattr__(
self, "native_dtype", pd.CategoricalDtype(self.categories, self.ordered)
)
# conversion for category instance with non-default arguments
@PandasBackend.register(Category, pd.CategoricalDtype)
def _to_pandas_category(cat: pd.CategoricalDtype):
return PandasCategory(cat.categories, cat.ordered)
assert (
PandasBackend.dtype(Category) # by value
== PandasBackend.dtype(Category()) # by value
== PandasBackend.dtype(pd.CategoricalDtype) # by value
== PandasBackend.dtype(pd.CategoricalDtype()) # by value
== PandasCategory()
)
assert (
PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) # by type
== PandasBackend.dtype(Category(["a", "b"], ordered=True)) # by type
== PandasCategory(["a", "b"], ordered=True)
) The design avoids endless if-elses because each dtype is self-contained. Hopefully that's not over-engineered. Let's discuss if we can simplify or identify loopholes before we move on to implementing all dtypes and integrate it to pandera. |
great design work @jeffzi ! will read this and the gist over and chew on it for a few days. |
Hi, I'm highly interested in the usage of |
@jeffzi the implementation looks good to me overall! I'm having a hard time grokking I wonder if we can abstract out Also have a few questions:
|
hey @ryanhaarmann, thanks that would be awesome! we'd appreciate your thoughts on this issue, but also a closely related one: #381. Namely, would it be enough to leverage a library like koalas as a validation backend engine to perform validations on spark dataframes, or would you want access to the pyspark API when e.g. defining custom validation functions? The benefit of supporting pandas-like API wrappers like koalas or modin is that pandera itself can leverage those libraries to validate at scale and reduce the complexity of supporting alternative APIs. As you can see from the description and initial thoughts in #381, supporting a different validation engine (i.e. non-pandas) will require a fair bit of design/implementation work, but may be worth it in the end edit: |
It replicates the property PandasDtype.numpy_dtype. Pandas implementation will give back numpy or pandas dtypes, PySpark would give Spark types, etc. Currently,
I agree the code is confusing. In my mind, we have 2 kinds of inputs we want to accept for generating dtypes.
Another confusing part is that I wrapped those 2 mechanisms in a single decorator that automatically chooses the registration method. The idea was to hide the complexity. I agree it's too obscure, see end of this post for a solution.
I said "by value" for
That's because the function is forwarded to singledispatch that would then register "self" as the dispatch type.
Agreed. We can also rename the class decorator to |
Quick update.
examples: @PandasEngine.register_dtype(
akin=[pandera.dtype.Int64, pandera.dtype.Int64(), "int64", np.int64]
)
class PandasInt64(pandera.dtype.Int64, _PandasInt):
nullable: bool = False
@PandasEngine.register_dtype(akin=[pandera.dtype.Category, pd.CategoricalDtype])
class PandasCategory(PandasDtype, pandera.dtype.Category):
def __post_init__(self) -> "PandasDtype":
super().__post_init__()
object.__setattr__(
# _native_dtype is used for coercion in base PandasDtype
self, "_native_dtype", pd.CategoricalDtype(self.categories, self.ordered)
)
@classmethod
def from_parametrized_dtype(
cls, cat: Union[pandera.dtype.Category, pd.CategoricalDtype]
):
return PandasCategory(categories=cat.categories, ordered=cat.ordered)
from pandera.dtype import Category
from pandera.engines.pandas_engine import PandasCategory
assert (
PandasEngine.dtype(Category)
== PandasEngine.dtype(pd.CategoricalDtype)
== PandasEngine.dtype(Category()) # dispatch via from_parametrized_dtype
== PandasEngine.dtype(pd.CategoricalDtype()) # dispatch via from_parametrized_dtype
== PandasCategory()
) Hopefuly it's easier to understand, I'm quite happy with how it's turning out. I did not update the gist (too lazy). Now, need to refactor all the calls to |
I know tests that involve testing types is kinda all over the place, hopefully it won't be too much of a pain to refactor 😅. One minor point: I don't any objective points to back this up, but |
At first I was going for "equivalent_dtypes" but it's very verbose and repeated many times. "equivalents" is perhaps a good middle-ground. English isn't my native language so I trust your judgment :) |
Hi @cosmicBboy. I'm still working on this, aiming for a PR this weekend. Testing has been (very) time consuming! |
thanks @jeffzi, yeah I'm sure you're uncovering all the random places there are type-related tests in the test suite 😅 |
fixed by #559 |
Is your feature request related to a problem? Please describe.
Currently,
pandera
's type system is strongly coupled to thepandas
type system. This works well in pandera's current state since it only supports pandas dataframe validation. However, in order to obtain a broader coverage of dataframe-like data structures in the python ecosystem, I think it makes sense to slowly move towards this goal by abstracting pandera's type system so that it's not so strongly coupled with pandas' type system.The
PandasDtype
enum class needs to be made more flexible such that it supports types with dynamic definitions likeCategoryDtype
andPeriodDtype
.Describe the solution you'd like
TBD
Describe alternatives you've considered
TBD
Additional context
TBD
The text was updated successfully, but these errors were encountered: