Mixins for constraints in SchemaModel #437

khwilson · 2021-03-14T03:43:47Z

khwilson
Mar 14, 2021

Following on from the discussion in #433 and #386. Mixins are quite common in the Python ecosystem, e.g., in werkzeug and SQLAlchemy. With the new SchemaModel syntax made available in pandera 0.5.x, mixins may make sense to formally bring into the pandera ecosystem either as recipes provided in documentation or as explicit utility classes provided in the package itself.

I think the most natural place they'd fit into the pandera workflow would be in constraint specification, especially wide constraints. For example, in tidy data, it is common that every table have a primary key. This is a collection of columns the values of which occur only once in a given table and, effectively, serve as a unique identifier for the row. Currently specifying this constraint in pandera looks something like this:

import pandas as pd
import pandera as pa
from pandera.typing import DateTime, Series

class Transaction(pa.SchemaModel):
    user_id: Series[int]
    occurred_at: Series[DateTime]
    value: Series[float]

    @pa.dataframe_check
    @classmethod
    def check_primary_key(cls, df: pd.DataFrame) -> bool:
        """ Verify that each pair of (user_id, occurred_at) is unique """
        return (df.groupby(['user_id', 'occurred_at']).size() <= 1).all()

But this could be pulled out into a mixin, e.g.,

def primary_key_mixin(*columns: Tuple[str]) -> type:
    class _PrimaryKeyMixin(pa.SchemaModel):
        __primary_key__: Tuple[str] = tuple(columns)

        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            """ Verify that each tuple of [columns] is unique """
            return (df.groupby(list(cls.__primary_key__)).size() <= 1).all()

    return _PrimaryKeyMixin


class Transaction(primary_key_mixin('user_id', 'occurred_at')):
    user_id: Series[int]
    occurred_at: Series[DateTime]
    value: Series[float]

khwilson · 2021-03-15T13:38:17Z

khwilson
Mar 15, 2021
Author

FWIW, an alternative version of this that avoids the mypy dynamic class issue brought up in #433 would look something like:

from typing import List, Optional, Tuple

import pandas as pd
import pandera as pa
from pandera.typing import DateTime, Series


class PrimaryKeyMixin(pa.SchemaModel):
    __primary_key__: Tuple[str, ...] = ()

    def __init_subclass__(cls, primary_key: Optional[List[str]] = None, **kwargs):
        super().__init_subclass__(**kwargs)
        cls.__primary_key__ = tuple(primary_key or [])

    @pa.dataframe_check
    @classmethod
    def check_primary_key(cls, df: pd.DataFrame) -> bool:
        return (df.groupby(list(cls.__primary_key__)).size() <= 1).all()


class Transaction(PrimaryKeyMixin, primary_key=['user_id', 'occurred_at']):
    user_id: Series[int]
    occurred_at: Series[DateTime]
    value: Series[float]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixins for constraints in SchemaModel #437

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Mixins for constraints in SchemaModel #437

khwilson Mar 14, 2021

Replies: 1 comment

khwilson Mar 15, 2021 Author

khwilson
Mar 14, 2021

khwilson
Mar 15, 2021
Author