Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixins with only dataframe_checks cause error in SchemaModel #433

Closed
2 of 3 tasks
khwilson opened this issue Mar 11, 2021 · 7 comments · Fixed by #434
Closed
2 of 3 tasks

Mixins with only dataframe_checks cause error in SchemaModel #433

khwilson opened this issue Mar 11, 2021 · 7 comments · Fixed by #434
Labels
bug Something isn't working

Comments

@khwilson
Copy link

Describe the bug
When creating a SchemaModel, if it only contains dataframe_checks then the to_schema function fails when it tries to take vars(config).

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

This fails

from typing import List

import pandas as pd
import pandera as pa


def primary_key_mixin(columns: List[str]):
    class _PrimaryKeyMixin(pa.SchemaModel):
        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            return (df.groupby(columns).size() <= 1).all()

    return _PrimaryKeyMixin


class TestSchema(primary_key_mixin(['primary_key'])):
    primary_key: pa.typing.Series[int]
    other_value: pa.typing.Series[str]


df = pd.DataFrame({
    'primary_key': [1, 2, 3],
    'other_value': ['a', 'b', 'c'],
})

TestSchema.validate(df)

With traceback (replaced some directory names for privacy):

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    TestSchema.validate(df)
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 184, in validate
    return cls.to_schema().validate(
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 150, in to_schema
    cls.__config__ = cls._collect_config()
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 331, in _collect_config
    base_options = _extract_config_options(config)
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 97, in _extract_config_options
    for name, value in vars(config).items()
TypeError: vars() argument must have __dict__ attribute

Expected behavior

The above should run to completion

Desktop (please complete the following information):

  • OS: MacOS 10.15.7
  • Browser N/A
  • Version pandera 0.6.2; Python 3.8.6

Additional context

If you add a single class-level variable with an annotation of its own it succeeds, e.g.:

from typing import List, Tuple

import pandas as pd
import pandera as pa


def primary_key_mixin(columns: List[str]):
    class _PrimaryKeyMixin(pa.SchemaModel):
        __primary_key__: Tuple[str] = tuple(columns)

        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            return (df.groupby(columns).size() <= 1).all()

    return _PrimaryKeyMixin


class TestSchema(primary_key_mixin(['primary_key'])):
    primary_key: pa.typing.Series[int]
    other_value: pa.typing.Series[str]


df = pd.DataFrame({
    'primary_key': [1, 2, 3],
    'other_value': ['a', 'b', 'c'],
})

TestSchema.validate(df)

However, if you do NOT have an annotation, it still fails:

from typing import List, Tuple

import pandas as pd
import pandera as pa


def primary_key_mixin(columns: List[str]):
    class _PrimaryKeyMixin(pa.SchemaModel):
        __primary_key__ = tuple(columns)

        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            return (df.groupby(columns).size() <= 1).all()

    return _PrimaryKeyMixin


class TestSchema(primary_key_mixin(['primary_key'])):
    primary_key: pa.typing.Series[int]
    other_value: pa.typing.Series[str]


df = pd.DataFrame({
    'primary_key': [1, 2, 3],
    'other_value': ['a', 'b', 'c'],
})

TestSchema.validate(df)
#Traceback (most recent call last):
# File "test.py", line 29, in <module>
#    TestSchema.validate(df)
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 184, in validate
#    return cls.to_schema().validate(
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 150, in to_schema
#    cls.__config__ = cls._collect_config()
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 331, in _collect_config
#    base_options = _extract_config_options(config)
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 97, in _extract_config_options
#    for name, value in vars(config).items()
#TypeError: vars() argument must have __dict__ attribute
@khwilson khwilson added the bug Something isn't working label Mar 11, 2021
@khwilson
Copy link
Author

First off, didn't get to say it in the formal report above, but thank you for putting your time into this wonderful library!

To business, it turns out the error turns out to be simpler. You can reproduce the error for any schema with no annotations. For instance, the following code snippet produces a similar error:

import pandera as pa

class ASchema(pa.SchemaModel):
    pass

ASchema.to_schema()
# Traceback (most recent call last):
#   File "test.py", line 10, in <module>
#     ASchema.to_schema()
#   File ".../pandera/model.py", line 150, in to_schema
#     cls.__config__ = cls._collect_config()
#   File ".../pandera/model.py", line 331, in _collect_config
#     base_options = _extract_config_options(config)
#   File ".../pandera/model.py", line 97, in _extract_config_options
#     for name, value in vars(config).items()
# TypeError: vars() argument must have __dict__ attribute

The issue seems to stem from this line: https://github.com/pandera-dev/pandera/blob/master/pandera/model.py#L126

Apparently, at least in Python 3.8.6, if no annotations are specified in the subclass, cls.__annotations__ is (maybe?) the superclass's annotated fields. But if even a single annotation is specified, the __annotations__ dictionary contains only the annotated fields of the subclass.

As an example, if we put a print(cls.__annotations__) right before the for loop there, we get the following:

import pandera as pa

class ASchema(pa.SchemaModel):
    pass
# {'Config': typing.Type[pandera.model.BaseConfig],
#  '__checks__': typing.Dict[str, typing.List[pandera.checks.Check]],
#  '__config__': typing.Union[typing.Type[pandera.model.BaseConfig], NoneType],
#  '__dataframe_checks__': typing.List[pandera.checks.Check],
#  '__fields__': typing.Dict[str, typing.Tuple[pandera.typing.AnnotationInfo, pandera.model_components.FieldInfo]],
#  '__schema__': typing.Union[pandera.schemas.DataFrameSchema, NoneType]}

But if we add an annotation we get:

import pandera as pa

class ASchema(pa.SchemaModel):
    a: pa.typing.Series[int]
# {'a': pandera.typing.Series[int]}

One possible fix would be to use __new__(cls, name, bases, dct) instead of __init_subclass__ and explicitly check if any(cls.__annotations__ is base.__annotations__ for base in bases). But that seems quite brittle.

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 12, 2021

Thanks @khwilson for the bug report and concise examples.

You are right, the problem comes from __annotations__ returning its super's annotations (i.e SchemaModel) if the class doesn't have annotations itself. Config is then misinterpreted as a missing field and overwritten.

I've just pushed a fix. I simply filter out annotations that starts with "_" and "Config", as it is done when collecting regular fields.

Btw, your primary_key_mixin is very nice 🤩

@khwilson
Copy link
Author

Excellent! One quick comment on your solution, though: it's not common practice but I assume there will be times people have fields that start with _. Perhaps at least updating the documentation to reflect that?

And I'm happy to do a PR with the primary_key_mixin if you wanted to add it to the main library. I've been getting some decent mileage out of it. :-)

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 12, 2021

Actually private annotations have always been ignored since SchemaModel was introduced. I realized now that it was not to specified in the docs. I'll do that !

I think the primary key would be better as a recipe. It seems very specific for pandera itself, but @cosmicBboy might have another opinion. It would be nice if you could post the template in the discussions. I'm sure many people would find it useful.

One thing is that mypy complains: error: Unsupported dynamic base class "primary_key_mixin". We can continue in "discussions" if you don't mind :)

@cosmicBboy
Copy link
Collaborator

thanks @khwilson for pointing this out!

Btw, your primary_key_mixin is very nice 🤩

+1, love how flexible this syntax is turning out to be :)

Thanks @jeffzi just ✅ your PR.

@khwilson
Copy link
Author

Thanks, y'all! Happy to start a discussion about the mixin!

@cosmicBboy
Copy link
Collaborator

fixed by #434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants