-
-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor inference, schema_statistics, strategies and io using the DataType hierarchy #504
Conversation
Codecov Report
@@ Coverage Diff @@
## dtypes #504 +/- ##
=========================================
Coverage ? 97.91%
=========================================
Files ? 24
Lines ? 3116
Branches ? 0
=========================================
Hits ? 3051
Misses ? 65
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! 🚀
have a few questions but it looks good
@@ -9,9 +9,10 @@ | |||
import pytest | |||
from packaging import version | |||
|
|||
import pandera as pa | |||
import pandera |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason for this import change? seems like it incurs a lot of changes that don't seem relevant to the overall diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got an error while serializing to yaml about global pandera not available. I was perhaps too quick to "fix" that import error. I'll investigate the root cause.
@@ -407,13 +421,13 @@ def _custom_check(series: pd.Series) -> pd.Series: | |||
) | |||
def test_series_strategy(data): | |||
"""Test SeriesSchema strategy.""" | |||
series_schema = pa.SeriesSchema(pa.Int, pa.Check.gt(0)) | |||
series_schema = pa.SeriesSchema(pa.Int(), pa.Check.gt(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] is pa.Int()
and more generally pa.[DataType]()
the way the public-facing pandera data types are used? i.e. could I do pa.Int
still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the situation. Externally, i.e. in schemas init you can pass the class. It will be handed to pandas_engine.Engine.dtype which will instantiate it, if there is a default constructor (won't work for parametrized dtypes with no defaults). Internally, we need an instance to call coerce, check, and eq methods.
TLDR User facing code doesn't change. I will have a pass and standardize to DataType class for schema creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR User facing code doesn't change
💯
strategies.dataframe_strategy( | ||
pdtype, strategies.pandas_dtype_strategy(pdtype) | ||
) | ||
# with pytest.raises(pa.errors.BaseStrategyOnlyError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment?
let's go ahead and drop |
I removed legacy pandas and py 3.6 from CI. While fixing doc tests, I encountered a breaking change related to handling of import numpy as np
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({"column1": [5, 1, np.nan]})
null_schema = DataFrameSchema({
"column1": Column(pa.Int, Check(lambda x: x > 0), nullable=True)
})
null_schema.validate(df)
#> column1
0 5.0
1 1.0
2 NaN
df.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 3 entries, 0 to 2
#> Data columns (total 1 columns):
#> # Column Non-Null Count Dtype
#> --- ------ -------------- -----
#> 0 column1 2 non-null float64
#> dtypes: float64(1)
#> memory usage: 152.0 bytes That fails with new DataType because the underlying type is I suggest the following:
We could have |
I think this breaking change is acceptable. The current behavior (doing a weird temporary type coercion for non-nan values) I think was one of those ad-hoc decisions I made that doesn't really make sense... I actually the new behavior is more intuitive: if a user wants to validate nullable integers they can specify the pandas-native nullable integer and either:
A user-facing nicety would be to (i) throw an exception or (ii) automatically change the type if the user specifies a non-pandas-native int type ( |
I agree raising an appropriate error message would be helpful to guide the user. However, there are 2 obstacles:
I can't think of another similar use case. Unless you have an idea to circumvent the issues above, I would let the user figure out that the dtype is float because of the presence of Nans. After all, it's a well-known limitation of numpy/pandas. |
looks like there's a Windows-related issue with the default types of the built-in schema = <Schema DataFrameSchema(columns={'a': <Schema Column(name=a, type=DataType(int32))>, 'b': <Schema Column(name=b, type=DataType(int32))>}, checks=[], index=None, coerce=False, dtype=None,strict=False,name=None,ordered=False)>
@pytest.mark.parametrize(
"data, error",
[
[
pd.DataFrame(
[[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["a", "a", "b"]
),
None,
],
[
pd.DataFrame(
[[1, 2, 3], list("xyz"), [7, 8, 9]], columns=["a", "a", "b"]
),
errors.SchemaError,
],
],
)
@pytest.mark.parametrize(
"schema",
[
DataFrameSchema({"a": Column(int), "b": Column(int)}),
DataFrameSchema({"a": Column(int, coerce=True), "b": Column(int)}),
DataFrameSchema({"a": Column(int, regex=True), "b": Column(int)}),
],
)
def test_dataframe_duplicated_columns(data, error, schema):
"""Test that schema can handle dataframes with duplicated columns."""
if error is None:
> assert isinstance(schema(data), pd.DataFrame)
tests\core\test_schemas.py:1500:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandera\schemas.py:630: in __call__
dataframe, head, tail, sample, random_state, lazy, inplace
pandera\schemas.py:575: in validate
error_handler.collect_error("schema_component_check", err)
pandera\error_handlers.py:32: in collect_error
raise schema_error from original_exc
pandera\schemas.py:571: in validate
inplace=True,
pandera\schemas.py:1796: in __call__
check_obj, head, tail, sample, random_state, lazy, inplace
pandera\schema_components.py:206: in validate
check_obj[column_name].iloc[:, [i]], column_name
pandera\schema_components.py:189: in validate_column
inplace=inplace,
pandera\schemas.py:1739: in validate
check=f"dtype('{self.dtype}')",
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandera.error_handlers.SchemaErrorHandler object at 0x0000028AC5111888>
reason_code = 'wrong_dtype'
schema_error = SchemaError("expected series 'a' to have type int32, got int64")
original_exc = None
def collect_error(
self,
reason_code: str,
schema_error: SchemaError,
original_exc: BaseException = None,
):
"""Collect schema error, raising exception if lazy is False.
:param reason_code: string representing reason for error
:param schema_error: ``SchemaError`` object.
"""
if not self._lazy:
> raise schema_error from original_exc
E pandera.errors.SchemaError: expected series 'a' to have type int32, got int64 |
023847f
to
e87398f
Compare
lol, okay so after much struggle I got CI tests to pass... here's a summary of the issues with windows:
The way the old type system handled this discrepancy is to use the pandas default dtype (int64) when passing in The problem with this is unexpected behavior for Windows users. There haven't been any issues/bugs filed, so that tells me the problem isn't that serious, but basically: # windows
pd.Series([1,2,3]) # int64
pd.Series([1,2,3], dtype=int) # int32
pd.Series([1,2,3], dtype="int") # int32
pd.Series([1,2,3], dtype=np.uint) # uint32
pd.Series([1,2,3], dtype="uint") # uint32 The hack that makes these tests pass with the new type system (also the old type system behavior) is to map
A more principled approach would be to revert to changes to 3ce0210 (before all my windows-hacking) and update the unit tests to explicitly specify the types whenever defining test data, so edit: I think at least for this PR we can maintain the current behavior of default to int64 regardless of the platform... testing on windows without a windows machine is a pain, and we can re-visit this later of people complain :) |
oof, thanks again for the much needed help.
I do have a dual-boot machine linux/windows but I'd need to set up a development environment on windows. I agree it's not a priority now. We can merge this first if you are happy with the current status :) I can add proper documentation over the weekend and finally wind down this refactor ! I'm a bit afraid of breaking user's worfklow because the changes are massive... I'm fully expecting bug reports when this goes live. |
cool! I'm not too worried about breaking changes... the only major one is |
…taType hierarchy (#504) * fix pandas_engine.Interval * fix Timedelta64 registration with pandas_engine.Engine * add DataType helpers * add DataType.continuous attribute * add dtypes.is_numeric * refactor schema_statistics based on DataType hierarchy * refactor schema_inference based on DataType hierarchy * fix numpy_engine.Timedelta64.type * add is_subdtype helper * add Engine.get_registered_dtypes * fix Engine error when registering a base DataType * fix pandas_engine DateTime string alias * clean up test_dtypes * fix test_extensions * refactor strategies based on DataType hierarchy * refactor io based on DataType hierarchy * replace dtypes module by new DataType hierarchy * fix black * delete dtypes_.py * drop legacy pandas and python 3.6 from CI * fix mypy errors * fix ci-docs * fix conda dependencies * fix lint, update noxfile * simplify nox tests, fix test_io * update ci build * update nox * pin nox, handle windows data types * fix windows platform * fix pandas_engine on windows platform * fix test_dtypes on windows platform * force pip on docs CI * test out windows dtype stuff * more messing around with windows * more debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * revert ci * increase cache * testing Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>
* refactor PandasDtype into class hierarchy supported by engines * refactor DataFrameSchema based on DataType hierarchy * refactor SchemaModel based on DataType hierarchy * revert fix coerce=True and dtype=None should be a noop * apply code style * fix running tests/core with nox * consolidate dtype names * consolidate engine internal naming * disable inherited __init__ with immutable(init=False) * delete duplicated immutable * disambiguate dtype variables * add warning on base pandas_engine, numpy_engine.DataType init * fix pylint, mypy errors * fix DataFrameSchema.dtypes return type * enable CI on dtypes branch * Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504) * fix pandas_engine.Interval * fix Timedelta64 registration with pandas_engine.Engine * add DataType helpers * add DataType.continuous attribute * add dtypes.is_numeric * refactor schema_statistics based on DataType hierarchy * refactor schema_inference based on DataType hierarchy * fix numpy_engine.Timedelta64.type * add is_subdtype helper * add Engine.get_registered_dtypes * fix Engine error when registering a base DataType * fix pandas_engine DateTime string alias * clean up test_dtypes * fix test_extensions * refactor strategies based on DataType hierarchy * refactor io based on DataType hierarchy * replace dtypes module by new DataType hierarchy * fix black * delete dtypes_.py * drop legacy pandas and python 3.6 from CI * fix mypy errors * fix ci-docs * fix conda dependencies * fix lint, update noxfile * simplify nox tests, fix test_io * update ci build * update nox * pin nox, handle windows data types * fix windows platform * fix pandas_engine on windows platform * fix test_dtypes on windows platform * force pip on docs CI * test out windows dtype stuff * more messing around with windows * more debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * revert ci * increase cache * testing Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> * Add DataTypes documentation (#536) * delete print statements * pin furo * fix generated docs not removed by nox * re-organize API section * replace aliased pandas_engine data types with their aliases * drop warning when calling Engine.register_dtype without arguments * add data types to api reference doc * add document for DataType refactor * unpin sphinx and drop sphinx_rtd_theme * add xdoctest * ignore prompt when copying example from doc * add doctest builder when running sphinx-build locally * fix dtypes doc examples * fix pandas_engine.DataType.check * fix pylint * remove whitespaces in dtypes doc * Update docs/source/dtypes.rst * Update dtypes.rst * update docs structure * update nox file * force pip on doctests * update test_schemas * fix docs session not overriding html with doctest output Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * add deprecation warnings for pandas_dtype and PandasDtype enum (#547) * remove auto-generated docs * add deprecation warnings, support pandas>=1.3.0 * add deprecation warnings for PandasDtype enum * fix sphinx * fix windows * fix windows * add support for pyarrow backed string data type (#548) * add support for pyarrow backed string data type * fix regression for pandas < 1.3.0 * add verbosity to test run * loosen strategies unit tests deadline, exclude windows ci * loosen test_strategies.py tests * use "dev" hypothesis profile for python 3.7 * add pandas==1.2.5 test * fix ci * ci typo * don't install environment.yml on unit tests * install nox in ci * remove environment.yml * update environment in ci Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
* refactor PandasDtype into class hierarchy supported by engines * refactor DataFrameSchema based on DataType hierarchy * refactor SchemaModel based on DataType hierarchy * revert fix coerce=True and dtype=None should be a noop * apply code style * fix running tests/core with nox * consolidate dtype names * consolidate engine internal naming * disable inherited __init__ with immutable(init=False) * delete duplicated immutable * disambiguate dtype variables * add warning on base pandas_engine, numpy_engine.DataType init * fix pylint, mypy errors * fix DataFrameSchema.dtypes return type * enable CI on dtypes branch * Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504) * fix pandas_engine.Interval * fix Timedelta64 registration with pandas_engine.Engine * add DataType helpers * add DataType.continuous attribute * add dtypes.is_numeric * refactor schema_statistics based on DataType hierarchy * refactor schema_inference based on DataType hierarchy * fix numpy_engine.Timedelta64.type * add is_subdtype helper * add Engine.get_registered_dtypes * fix Engine error when registering a base DataType * fix pandas_engine DateTime string alias * clean up test_dtypes * fix test_extensions * refactor strategies based on DataType hierarchy * refactor io based on DataType hierarchy * replace dtypes module by new DataType hierarchy * fix black * delete dtypes_.py * drop legacy pandas and python 3.6 from CI * fix mypy errors * fix ci-docs * fix conda dependencies * fix lint, update noxfile * simplify nox tests, fix test_io * update ci build * update nox * pin nox, handle windows data types * fix windows platform * fix pandas_engine on windows platform * fix test_dtypes on windows platform * force pip on docs CI * test out windows dtype stuff * more messing around with windows * more debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * revert ci * increase cache * testing Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> * Add DataTypes documentation (#536) * delete print statements * pin furo * fix generated docs not removed by nox * re-organize API section * replace aliased pandas_engine data types with their aliases * drop warning when calling Engine.register_dtype without arguments * add data types to api reference doc * add document for DataType refactor * unpin sphinx and drop sphinx_rtd_theme * add xdoctest * ignore prompt when copying example from doc * add doctest builder when running sphinx-build locally * fix dtypes doc examples * fix pandas_engine.DataType.check * fix pylint * remove whitespaces in dtypes doc * Update docs/source/dtypes.rst * Update dtypes.rst * update docs structure * update nox file * force pip on doctests * update test_schemas * fix docs session not overriding html with doctest output Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * add deprecation warnings for pandas_dtype and PandasDtype enum (#547) * remove auto-generated docs * add deprecation warnings, support pandas>=1.3.0 * add deprecation warnings for PandasDtype enum * fix sphinx * fix windows * fix windows * add support for pyarrow backed string data type (#548) * add support for pyarrow backed string data type * fix regression for pandas < 1.3.0 * add verbosity to test run * loosen strategies unit tests deadline, exclude windows ci * loosen test_strategies.py tests * use "dev" hypothesis profile for python 3.7 * add pandas==1.2.5 test * fix ci * ci typo * don't install environment.yml on unit tests * install nox in ci * remove environment.yml * update environment in ci Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
* Feature/420 (#454) * parse frictionless schema - using frictionless-py for some of the heavy lifting - accept yaml/json/frictionless schema files/objects directly - frictionless becomes a new requirement for io - apply pre-commit formatting updates to other code in pandera.io - add test to validate schema parsing, from yaml and json sources * improve documentation * update docstrings per code review Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * add type hints * standardise class properties for easier re-use in future * simplify key check * add missing alternative type * update docstring * align name with Column arg * fix NaN check * fix type assertion * create empty dict if constraints not provided Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * decouple pandera and pandas dtypes (#559) * refactor PandasDtype into class hierarchy supported by engines * refactor DataFrameSchema based on DataType hierarchy * refactor SchemaModel based on DataType hierarchy * revert fix coerce=True and dtype=None should be a noop * apply code style * fix running tests/core with nox * consolidate dtype names * consolidate engine internal naming * disable inherited __init__ with immutable(init=False) * delete duplicated immutable * disambiguate dtype variables * add warning on base pandas_engine, numpy_engine.DataType init * fix pylint, mypy errors * fix DataFrameSchema.dtypes return type * enable CI on dtypes branch * Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504) * fix pandas_engine.Interval * fix Timedelta64 registration with pandas_engine.Engine * add DataType helpers * add DataType.continuous attribute * add dtypes.is_numeric * refactor schema_statistics based on DataType hierarchy * refactor schema_inference based on DataType hierarchy * fix numpy_engine.Timedelta64.type * add is_subdtype helper * add Engine.get_registered_dtypes * fix Engine error when registering a base DataType * fix pandas_engine DateTime string alias * clean up test_dtypes * fix test_extensions * refactor strategies based on DataType hierarchy * refactor io based on DataType hierarchy * replace dtypes module by new DataType hierarchy * fix black * delete dtypes_.py * drop legacy pandas and python 3.6 from CI * fix mypy errors * fix ci-docs * fix conda dependencies * fix lint, update noxfile * simplify nox tests, fix test_io * update ci build * update nox * pin nox, handle windows data types * fix windows platform * fix pandas_engine on windows platform * fix test_dtypes on windows platform * force pip on docs CI * test out windows dtype stuff * more messing around with windows * more debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * revert ci * increase cache * testing Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> * Add DataTypes documentation (#536) * delete print statements * pin furo * fix generated docs not removed by nox * re-organize API section * replace aliased pandas_engine data types with their aliases * drop warning when calling Engine.register_dtype without arguments * add data types to api reference doc * add document for DataType refactor * unpin sphinx and drop sphinx_rtd_theme * add xdoctest * ignore prompt when copying example from doc * add doctest builder when running sphinx-build locally * fix dtypes doc examples * fix pandas_engine.DataType.check * fix pylint * remove whitespaces in dtypes doc * Update docs/source/dtypes.rst * Update dtypes.rst * update docs structure * update nox file * force pip on doctests * update test_schemas * fix docs session not overriding html with doctest output Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * add deprecation warnings for pandas_dtype and PandasDtype enum (#547) * remove auto-generated docs * add deprecation warnings, support pandas>=1.3.0 * add deprecation warnings for PandasDtype enum * fix sphinx * fix windows * fix windows * add support for pyarrow backed string data type (#548) * add support for pyarrow backed string data type * fix regression for pandas < 1.3.0 * add verbosity to test run * loosen strategies unit tests deadline, exclude windows ci * loosen test_strategies.py tests * use "dev" hypothesis profile for python 3.7 * add pandas==1.2.5 test * fix ci * ci typo * don't install environment.yml on unit tests * install nox in ci * remove environment.yml * update environment in ci Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com> * improve coverage * fix docs * add pandas accessor tests * pin sphinx * fix lint Co-authored-by: Tom Collingwood <38299499+TColl@users.noreply.github.com> Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
This PR is a follow-up to the dtypes refactor started in #490 and initially discussed in #369.
pandas_engine.Engine.numpy_dtype()
to help strategies convertingDataType
sNext PR will add proper documentation. I think we can merge this PR even if legacy pandas tests don't pass. I can take care of removing py3.6 and pandas 0.25 in another PR. Then I'll come back to verifying DataType coverage and tests.
Breaking changes
pandas_dtype
property was renamed todtype
. We should add a logic to handlepandas_dtype
with a deprecation warning before dropping support.Other modules should work as before.
Let me know if you see opportunities for improvements. I've been working on this for a while and may not notice obvious shortcomings 🥺