Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor inference, schema_statistics, strategies and io using the DataType hierarchy #504

Merged
merged 48 commits into from
Jun 17, 2021

Conversation

jeffzi
Copy link
Collaborator

@jeffzi jeffzi commented May 28, 2021

This PR is a follow-up to the dtypes refactor started in #490 and initially discussed in #369.

  • DataFrameSchema api
  • SchemaModel api
  • decorators
  • inference
  • schema_statistics
  • is_float(), is_continuous(), etc.
  • strategies: added pandas_engine.Engine.numpy_dtype() to help strategies converting DataTypes
  • docstrings: partial
  • documentation

Next PR will add proper documentation. I think we can merge this PR even if legacy pandas tests don't pass. I can take care of removing py3.6 and pandas 0.25 in another PR. Then I'll come back to verifying DataType coverage and tests.

Breaking changes

  • Schemas serialized with prior versions cannot be deserialized because the pandas_dtype property was renamed to dtype. We should add a logic to handle pandas_dtype with a deprecation warning before dropping support.

Other modules should work as before.

Let me know if you see opportunities for improvements. I've been working on this for a while and may not notice obvious shortcomings 🥺

@jeffzi jeffzi requested a review from cosmicBboy May 28, 2021 22:36
@codecov
Copy link

codecov bot commented May 28, 2021

Codecov Report

❗ No coverage uploaded for pull request base (dtypes@4b2101d). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             dtypes     #504   +/-   ##
=========================================
  Coverage          ?   97.91%           
=========================================
  Files             ?       24           
  Lines             ?     3116           
  Branches          ?        0           
=========================================
  Hits              ?     3051           
  Misses            ?       65           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b2101d...1edce1d. Read the comment docs.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work! 🚀

have a few questions but it looks good

@@ -9,9 +9,10 @@
import pytest
from packaging import version

import pandera as pa
import pandera
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason for this import change? seems like it incurs a lot of changes that don't seem relevant to the overall diff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got an error while serializing to yaml about global pandera not available. I was perhaps too quick to "fix" that import error. I'll investigate the root cause.

@@ -407,13 +421,13 @@ def _custom_check(series: pd.Series) -> pd.Series:
)
def test_series_strategy(data):
"""Test SeriesSchema strategy."""
series_schema = pa.SeriesSchema(pa.Int, pa.Check.gt(0))
series_schema = pa.SeriesSchema(pa.Int(), pa.Check.gt(0))
Copy link
Collaborator

@cosmicBboy cosmicBboy May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] is pa.Int() and more generally pa.[DataType]() the way the public-facing pandera data types are used? i.e. could I do pa.Int still?

Copy link
Collaborator Author

@jeffzi jeffzi May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the situation. Externally, i.e. in schemas init you can pass the class. It will be handed to pandas_engine.Engine.dtype which will instantiate it, if there is a default constructor (won't work for parametrized dtypes with no defaults). Internally, we need an instance to call coerce, check, and eq methods.

TLDR User facing code doesn't change. I will have a pass and standardize to DataType class for schema creation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR User facing code doesn't change

💯

strategies.dataframe_strategy(
pdtype, strategies.pandas_dtype_strategy(pdtype)
)
# with pytest.raises(pa.errors.BaseStrategyOnlyError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment?

@cosmicBboy
Copy link
Collaborator

let's go ahead and drop 0.25.3 and python 3.6 in the CI and update setup.py/environment.yml/requirements.txt to reflect new version constraints.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Jun 2, 2021

I removed legacy pandas and py 3.6 from CI.

While fixing doc tests, I encountered a breaking change related to handling of pa.Int. The new DataType are simply boxing/wrapping in order to standardize the interface across libraries. On the contrary, pa.Int represents a "semantic" integer, not a concrete numpy.int_. See example below:

import numpy as np
import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({"column1": [5, 1, np.nan]})

null_schema = DataFrameSchema({
    "column1": Column(pa.Int, Check(lambda x: x > 0), nullable=True)
})

null_schema.validate(df)
#>    column1
0      5.0
1      1.0
2      NaN
df.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 3 entries, 0 to 2
#> Data columns (total 1 columns):
#>  #   Column   Non-Null Count  Dtype  
#> ---  ------   --------------  -----  
#>  0   column1  2 non-null      float64
#> dtypes: float64(1)
#> memory usage: 152.0 bytes

That fails with new DataType because the underlying type is np.float64. To reproduce the existing behavior, DataType.check() should accept a pandas.Series.

I suggest the following:

  • Keep the breaking change and document it. It's probably an edge case that few users rely upon. Moreover, dropping pandas 0.25 means that we have access to nullable integers which should be preferred in that situation.
  • Wait for the abstraction of validation logic (Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381) to implement abstract data type such as Path, Url, etc in the fashion of visions. At that point, we would have a better idea on how to tackle abstract data types (e.g. let `DataType.check() access data on top of native dtypes).

We could have pa.Int mapped to nullable pandas.IntegerDtype but I'd rather follow pandas' defaults.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 3, 2021

I think this breaking change is acceptable. The current behavior (doing a weird temporary type coercion for non-nan values) I think was one of those ad-hoc decisions I made that doesn't really make sense... I actually the new behavior is more intuitive: if a user wants to validate nullable integers they can specify the pandas-native nullable integer and either:

  1. set the datatype manually on pandas side (e.g. pd.Series([1, 2, np.nan], dtype="Int64")
  2. set coerce=True

A user-facing nicety would be to (i) throw an exception or (ii) automatically change the type if the user specifies a non-pandas-native int type (int or the numpy ints) and sets nullable=True (we can do this in another PR). I'd err on the side of (i) just because I tend to like minimizing magic, at least when it comes to the problem domain pandera is solving.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Jun 10, 2021

I agree raising an appropriate error message would be helpful to guide the user. However, there are 2 obstacles:

  1. DataType.check only receives the dtype to check. It can only know that you are checking a float against an integer but cannot verify if the data contains Nans.
  2. check returns a boolean and cannot communicate a specific issue to its caller (DataFrameSchema.validate).

I can't think of another similar use case. Unless you have an idea to circumvent the issues above, I would let the user figure out that the dtype is float because of the presence of Nans. After all, it's a well-known limitation of numpy/pandas.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 17, 2021

looks like there's a Windows-related issue with the default types of the built-in int, etc. types... it's not ideal since I can't test any of this locally :/

schema = <Schema DataFrameSchema(columns={'a': <Schema Column(name=a, type=DataType(int32))>, 'b': <Schema Column(name=b, type=DataType(int32))>}, checks=[], index=None, coerce=False, dtype=None,strict=False,name=None,ordered=False)>

    @pytest.mark.parametrize(
        "data, error",
        [
            [
                pd.DataFrame(
                    [[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["a", "a", "b"]
                ),
                None,
            ],
            [
                pd.DataFrame(
                    [[1, 2, 3], list("xyz"), [7, 8, 9]], columns=["a", "a", "b"]
                ),
                errors.SchemaError,
            ],
        ],
    )
    @pytest.mark.parametrize(
        "schema",
        [
            DataFrameSchema({"a": Column(int), "b": Column(int)}),
            DataFrameSchema({"a": Column(int, coerce=True), "b": Column(int)}),
            DataFrameSchema({"a": Column(int, regex=True), "b": Column(int)}),
        ],
    )
    def test_dataframe_duplicated_columns(data, error, schema):
        """Test that schema can handle dataframes with duplicated columns."""
        if error is None:
>           assert isinstance(schema(data), pd.DataFrame)

tests\core\test_schemas.py:1500: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandera\schemas.py:630: in __call__
    dataframe, head, tail, sample, random_state, lazy, inplace
pandera\schemas.py:575: in validate
    error_handler.collect_error("schema_component_check", err)
pandera\error_handlers.py:32: in collect_error
    raise schema_error from original_exc
pandera\schemas.py:571: in validate
    inplace=True,
pandera\schemas.py:1796: in __call__
    check_obj, head, tail, sample, random_state, lazy, inplace
pandera\schema_components.py:206: in validate
    check_obj[column_name].iloc[:, [i]], column_name
pandera\schema_components.py:189: in validate_column
    inplace=inplace,
pandera\schemas.py:1739: in validate
    check=f"dtype('{self.dtype}')",
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandera.error_handlers.SchemaErrorHandler object at 0x0000028AC5111888>
reason_code = 'wrong_dtype'
schema_error = SchemaError("expected series 'a' to have type int32, got int64")
original_exc = None

    def collect_error(
        self,
        reason_code: str,
        schema_error: SchemaError,
        original_exc: BaseException = None,
    ):
        """Collect schema error, raising exception if lazy is False.
    
        :param reason_code: string representing reason for error
        :param schema_error: ``SchemaError`` object.
        """
        if not self._lazy:
>           raise schema_error from original_exc
E           pandera.errors.SchemaError: expected series 'a' to have type int32, got int64

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 17, 2021

lol, okay so after much struggle I got CI tests to pass... here's a summary of the issues with windows:

  • the default numpy dtype for int is int32, so np.dtype(int) == np.int32, and pd.Series([1], dtype=int).dtype == np.int32. This also applies to "uint"
  • however, the default pandas dtype when you pass in a list of ints without specifying the dtype is np.int64, i.e. pd.Series([1]).dtype == np.int64. This discrepancy isn't an issue for uint since there's no way of getting a uint Series by default, so pd.Series([1], dtype="uint").dtype == np.uint32

The way the old type system handled this discrepancy is to use the pandas default dtype (int64) when passing in int or "int" as a column's data type.

The problem with this is unexpected behavior for Windows users. There haven't been any issues/bugs filed, so that tells me the problem isn't that serious, but basically:

# windows
pd.Series([1,2,3])  # int64
pd.Series([1,2,3], dtype=int)  # int32
pd.Series([1,2,3], dtype="int")  # int32

pd.Series([1,2,3], dtype=np.uint)  # uint32
pd.Series([1,2,3], dtype="uint")  # uint32

The hack that makes these tests pass with the new type system (also the old type system behavior) is to map int and "int" to int64 instead of int32. However, this doesn't seem ideal because:

  1. it isn't how pandas actually behaves in Windows
  2. unit tests work around the pd.Series([1,2,3]) vs pd.Series([1,2,3], dtype=int) discrepancy by making special affordances to ints/uints.

A more principled approach would be to revert to changes to 3ce0210 (before all my windows-hacking) and update the unit tests to explicitly specify the types whenever defining test data, so pd.Series([1,2,3], dtype=int) instead of pd.Series([1,2,3]) (there are a bunch of tests that do the latter).

edit: I think at least for this PR we can maintain the current behavior of default to int64 regardless of the platform... testing on windows without a windows machine is a pain, and we can re-visit this later of people complain :)

@jeffzi
Copy link
Collaborator Author

jeffzi commented Jun 17, 2021

oof, thanks again for the much needed help.

testing on windows without a windows machine is a pain

I do have a dual-boot machine linux/windows but I'd need to set up a development environment on windows. I agree it's not a priority now. We can merge this first if you are happy with the current status :)

I can add proper documentation over the weekend and finally wind down this refactor ! I'm a bit afraid of breaking user's worfklow because the changes are massive... I'm fully expecting bug reports when this goes live.

@cosmicBboy
Copy link
Collaborator

cool! I'm not too worried about breaking changes... the only major one is PandasDtype, but other than that I think the tests should have caught any regression that might have happened. Merging now, thanks again for your work on this!

@cosmicBboy cosmicBboy merged commit 460663d into unionai-oss:dtypes Jun 17, 2021
@cosmicBboy cosmicBboy linked an issue Jun 18, 2021 that may be closed by this pull request
cosmicBboy added a commit that referenced this pull request Jul 2, 2021
…taType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>
cosmicBboy added a commit that referenced this pull request Jul 15, 2021
* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
cosmicBboy added a commit that referenced this pull request Jul 22, 2021
* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
cosmicBboy added a commit that referenced this pull request Jul 24, 2021
* Feature/420 (#454)

* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* decouple pandera and pandas dtypes (#559)

* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>

* improve coverage

* fix docs

* add pandas accessor tests

* pin sphinx

* fix lint

Co-authored-by: Tom Collingwood <38299499+TColl@users.noreply.github.com>
Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decouple pandera and pandas type systems
2 participants