Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor inference, schema_statistics, strategies and io using the DataType hierarchy #504

Merged
merged 48 commits into from
Jun 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
a6bda97
fix pandas_engine.Interval
May 24, 2021
3b55a41
fix Timedelta64 registration with pandas_engine.Engine
May 24, 2021
fc813f2
add DataType helpers
May 24, 2021
3296dd1
add DataType.continuous attribute
May 24, 2021
9f1902b
add dtypes.is_numeric
May 24, 2021
9417102
refactor schema_statistics based on DataType hierarchy
May 24, 2021
19bf63d
refactor schema_inference based on DataType hierarchy
May 24, 2021
a9cde9d
fix numpy_engine.Timedelta64.type
May 25, 2021
3300a72
add is_subdtype helper
May 25, 2021
73dbfc1
add Engine.get_registered_dtypes
May 25, 2021
cc02d2b
fix Engine error when registering a base DataType
May 28, 2021
e722afd
fix pandas_engine DateTime string alias
May 28, 2021
3bf126d
clean up test_dtypes
May 28, 2021
2faaa74
fix test_extensions
May 28, 2021
fecc9d3
refactor strategies based on DataType hierarchy
May 28, 2021
cb8776c
refactor io based on DataType hierarchy
May 28, 2021
34b9ecd
replace dtypes module by new DataType hierarchy
May 28, 2021
47b5ba0
fix black
May 28, 2021
ea7bccc
delete dtypes_.py
May 29, 2021
6682b15
drop legacy pandas and python 3.6 from CI
Jun 14, 2021
309fe50
fix mypy errors
Jun 14, 2021
a5017de
fix ci-docs
Jun 14, 2021
d2fb74e
fix conda dependencies
Jun 15, 2021
ca3bc6d
fix lint, update noxfile
cosmicBboy Jun 16, 2021
1126617
simplify nox tests, fix test_io
cosmicBboy Jun 16, 2021
f3af374
update ci build
cosmicBboy Jun 16, 2021
5eff814
update nox
cosmicBboy Jun 16, 2021
dc0045f
pin nox, handle windows data types
cosmicBboy Jun 16, 2021
265559a
fix windows platform
Jun 16, 2021
14f6e7f
fix pandas_engine on windows platform
Jun 16, 2021
7d94644
fix test_dtypes on windows platform
Jun 16, 2021
3ce0210
force pip on docs CI
cosmicBboy Jun 17, 2021
a4eee97
test out windows dtype stuff
cosmicBboy Jun 17, 2021
eca4241
more messing around with windows
cosmicBboy Jun 17, 2021
186554c
more debugging
cosmicBboy Jun 17, 2021
e87398f
debugging
cosmicBboy Jun 17, 2021
a080274
debugging
cosmicBboy Jun 17, 2021
4d95f4c
debugging
cosmicBboy Jun 17, 2021
d0cd0ec
debugging
cosmicBboy Jun 17, 2021
e9f74a4
debugging
cosmicBboy Jun 17, 2021
3ccfb00
debugging
cosmicBboy Jun 17, 2021
c1a7bc5
debugging
cosmicBboy Jun 17, 2021
a2415fb
debugging
cosmicBboy Jun 17, 2021
f8162ab
debugging
cosmicBboy Jun 17, 2021
34c4d62
debugging
cosmicBboy Jun 17, 2021
7e3db37
revert ci
cosmicBboy Jun 17, 2021
1120eda
increase cache
cosmicBboy Jun 17, 2021
1edce1d
testing
cosmicBboy Jun 17, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 19 additions & 22 deletions .github/workflows/ci-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,24 @@ name: CI Tests
on:
push:
branches:
- master
- dev
- bugfix
- 'release/*'
- dtypes
- master
- dev
- bugfix
- "release/*"
- dtypes
pull_request:
branches:
- master
- dev
- bugfix
- 'release/*'
- dtypes
- master
- dev
- bugfix
- "release/*"
- dtypes

env:
DEFAULT_PYTHON: 3.8
CI: "true"
# Increase this value to reset cache if environment.yml has not changed
CACHE_VERSION: 2
CACHE_VERSION: 3

jobs:
codestyle:
Expand Down Expand Up @@ -73,7 +73,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.6", "3.7", "3.8", "3.9"]
python-version: ["3.7", "3.8", "3.9"]
defaults:
run:
shell: bash -l {0}
Expand Down Expand Up @@ -135,16 +135,13 @@ jobs:

tests:
name: >
CI Tests (${{ matrix.python-version }},
${{ matrix.os }},
pandas-${{ matrix.pandas-version }})
CI Tests (${{ matrix.python-version }}, ${{ matrix.os }})
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: ["ubuntu-latest", "macos-latest", "windows-latest"]
python-version: ["3.6", "3.7", "3.8", "3.9"]
pandas-version: ["latest", "0.25.3"]
python-version: ["3.7", "3.8", "3.9"]

defaults:
run:
Expand Down Expand Up @@ -186,28 +183,28 @@ jobs:
nox
-db conda -r -v
--non-interactive
--session "tests-${{ matrix.python-version }}(extra='core', pandas='${{ matrix.pandas-version }}')"
--session "tests-${{ matrix.python-version }}(extra='core')"

- name: Unit Tests - Hypotheses
run: >
nox
-db conda -r -v
--non-interactive
--session "tests-${{ matrix.python-version }}(extra='hypotheses', pandas='${{ matrix.pandas-version }}')"
--session "tests-${{ matrix.python-version }}(extra='hypotheses')"

- name: Unit Tests - IO
run: >
nox
-db conda -r -v
--non-interactive
--session "tests-${{ matrix.python-version }}(extra='io', pandas='${{ matrix.pandas-version }}')"
--session "tests-${{ matrix.python-version }}(extra='io')"

- name: Unit Tests - Strategies
run: >
nox
-db conda -r -v
--non-interactive
--session "tests-${{ matrix.python-version }}(extra='strategies', pandas='${{ matrix.pandas-version }}')"
--session "tests-${{ matrix.python-version }}(extra='strategies')"

- name: Upload coverage to Codecov
uses: "codecov/codecov-action@v1"
Expand All @@ -217,4 +214,4 @@ jobs:
nox
-db conda -r -v
--non-interactive
--session "docs-${{ matrix.python-version }}(pandas='${{ matrix.pandas-version }}')"
--session "docs-${{ matrix.python-version }}"
2 changes: 1 addition & 1 deletion docs/source/API_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ Pandas Data Types
:template: pandas_dtype_class.rst
:nosignatures:

pandera.dtypes.PandasDtype
pandera.dtypes.DataType


Decorators
Expand Down
51 changes: 0 additions & 51 deletions docs/source/_templates/enum_class.rst

This file was deleted.

25 changes: 0 additions & 25 deletions docs/source/_templates/pandas_dtype_class.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@
import doctest
import inspect
import logging as pylogging
import subprocess

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import shutil
import subprocess
import sys

from sphinx.util import logging
Expand Down
43 changes: 20 additions & 23 deletions docs/source/dataframe_schemas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ nullable. In order to accept null values, you need to explicitly specify
df = pd.DataFrame({"column1": [5, 1, np.nan]})

non_null_schema = DataFrameSchema({
"column1": Column(pa.Int, Check(lambda x: x > 0))
"column1": Column(pa.Float, Check(lambda x: x > 0))
})

non_null_schema.validate(df)
Expand All @@ -91,18 +91,11 @@ nullable. In order to accept null values, you need to explicitly specify
...
SchemaError: non-nullable series contains null values: {2: nan}

.. note:: Due to a known limitation in
`pandas prior to version 0.24.0 <https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html>`_,
integer arrays cannot contain ``NaN`` values, so this schema will return
a DataFrame where ``column1`` is of type ``float``.
:class:`~pandera.dtypes.PandasDtype` does not currently support the nullable integer
array type, but you can still use the "Int64" string alias for nullable
integer arrays

.. testcode:: null_values_in_columns

null_schema = DataFrameSchema({
"column1": Column(pa.Int, Check(lambda x: x > 0), nullable=True)
"column1": Column(pa.Float, Check(lambda x: x > 0), nullable=True)
})

print(null_schema.validate(df))
Expand Down Expand Up @@ -401,7 +394,7 @@ schema, specify ``strict=True``:

Traceback (most recent call last):
...
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column: 'None' type=int>}
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column: 'None' type=DataType(int64)>}

Alternatively, if your DataFrame contains columns that are not in the schema,
and you would like these to be dropped on validation,
Expand Down Expand Up @@ -626,13 +619,17 @@ Some examples of where this can be provided to pandas are:
},
)

df = pd.DataFrame.from_dict(
{
"a": {"column1": 1, "column2": "valueA", "column3": True},
"b": {"column1": 1, "column2": "valueB", "column3": True},
},
orient="index"
).astype(schema.dtype).sort_index(axis=1)
df = (
pd.DataFrame.from_dict(
{
"a": {"column1": 1, "column2": "valueA", "column3": True},
"b": {"column1": 1, "column2": "valueB", "column3": True},
},
orient="index",
)
.astype({col: str(dtype) for col, dtype in schema.dtypes.items()})
.sort_index(axis=1)
)

print(schema.validate(df))

Expand Down Expand Up @@ -718,11 +715,11 @@ data pipeline:

<Schema DataFrameSchema(
columns={
'col1': <Schema Column(name=col1, type=int)>
'col1': <Schema Column(name=col1, type=DataType(int64))>
},
checks=[],
coerce=False,
pandas_dtype=None,
dtype=None,
index=None,
strict=True
name=None,
Expand Down Expand Up @@ -756,15 +753,15 @@ the pipeline output.

<Schema DataFrameSchema(
columns={
'column2': <Schema Column(name=column2, type=float)>
'column2': <Schema Column(name=column2, type=DataType(float64))>
},
checks=[],
coerce=True,
pandas_dtype=None,
dtype=None,
index=<Schema MultiIndex(
indexes=[
<Schema Index(name=column3, type=int)>
<Schema Index(name=column1, type=int)>
<Schema Index(name=column3, type=DataType(int64))>
<Schema Index(name=column1, type=DataType(int64))>
]
coerce=False,
strict=False,
Expand Down
10 changes: 5 additions & 5 deletions docs/source/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,20 +94,20 @@ The corresponding strategy for this check would be:
import pandera.strategies as st

def equals_strategy(
pandas_dtype: pa.PandasDtype,
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None,
*,
value,
):
if strategy is None:
return st.pandas_dtype_strategy(
pandas_dtype, strategy=hypothesis.strategies.just(value),
pandera_dtype, strategy=hypothesis.strategies.just(value),
)
return strategy.filter(lambda x: x == value)

As you may notice, the ``pandera`` strategy interface is has two arguments
followed by keyword-only arguments that match the check function keyword-only
check statistics. The ``pandas_dtype`` positional argument is useful for
check statistics. The ``pandera_dtype`` positional argument is useful for
ensuring the correct data type. In the above example, we're using the
:func:`~pandera.strategies.pandas_dtype_strategy` strategy to make sure the
generated ``value`` is of the correct data type.
Expand Down Expand Up @@ -147,15 +147,15 @@ would look like:
:skipif: SKIP_STRATEGY

def in_between_strategy(
pandas_dtype: pa.PandasDtype,
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None,
*,
min_value,
max_value
):
if strategy is None:
return st.pandas_dtype_strategy(
pandas_dtype,
pandera_dtype,
min_value=min_value,
max_value=max_value,
exclude_min=False,
Expand Down
6 changes: 3 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ Quick Start
You can pass the built-in python types that are supported by
pandas, or strings representing the
`legal pandas datatypes <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`_,
or pandera's ``PandasDtype`` enum:
or pandera's ``DataType``:

.. testcode:: quick_start

Expand All @@ -171,13 +171,13 @@ or pandera's ``PandasDtype`` enum:
# pandas > 1.0.0 support native "string" type
"str_column2": pa.Column("str"),

# pandera PandasDtype enum
# pandera DataType
"int_column3": pa.Column(pa.Int),
"float_column3": pa.Column(pa.Float),
"str_column3": pa.Column(pa.String),
})

For more details on data types, see :class:`~pandera.dtypes.PandasDtype`
For more details on data types, see :class:`~pandera.dtypes.DataType`


Schema Model
Expand Down
Loading