Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381

cosmicBboy · 2021-01-12T14:40:38Z

Is your feature request related to a problem? Please describe.

Extending pandera to non-pandas dataframe-like structures is a challenge today because the schema and schema component class definitions are strongly coupled with the pandas API. For example, the DataFramesSchema.validate method assumes that validated objects follow the pandas API.

Potential Solutions

Abstract out the core pandera interface into Schema, SchemaComponent, and Check abstract base classes so that core and third-party pandera schemas can be easily developed on top of it. Subclasses of these base classes would implement the validation logic for a specific library, e.g. SparkSchema, PandasSchema, etc.
Provide a validation engine interface where core and third-party developers can register and use different validation backends depending on the type of dataframe implementation (e.g. pandas, spark, dask, etc) being used, similar to the proposal in Decouple pandera and pandas type systems #369. The public-facing API won't change: different dataframe types would be validated via different (non-mutually exclusive) approaches:
- at runtime validation, pandera delegates to the appropriate engine based on the type of obj when schema.validate(obj) is called.
- add a engine: str option, to explicitly specify which engine to use. (q: should this be in __init__ or validate or both?)

Describe the solution you'd like

Because this is quite a momentous change in the pandera's scope (to support not just pandas dataframes), I'll first re-iterate the design philosophy of pandera:

minimize the proliferation of classes in the public-facing API
the schema-definition interface should be isomorphic to the data structure being validated i.e. defining a dataframe schema should feel like defining a dataframe
prioritize flexibility/expressiveness of validation functions, add built-ins for common checks (based on feature-parity of other similar schema libraries, or by popular request)

In keeping with these principles, I propose going with solution (2), in order to prevent an increase in the complexity and surface area of the user-facing API (DaskSchema, PandasSchema, SparkSchema, VaexSchema, etc).

edit:
Actually with solution (1), one approach that would keep the API surface area small is to use a subpackage pattern that replicates the pandera interface but with the alternative backend:

from pandera.spark as pa

spark_schema = pa.DataFrameSchema({...})

class SparkSchema(pa.SchemaModel):
    ...

Etc...

import pandera.dask
import pandera.modin

Will need to think through the pros and cons of 1 vs 2 some more...

Re: data synthesis strategies, which is used purely for testing and not meant to generate massive amounts of data, we could just fallback on pandas and convert the synthesized data to the corresponding dataframe type, assuming the df library supports this, e.g. spark.createDataFrame

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-04-10T20:10:11Z

Initial Thoughts

Currently, the schema and check classes conflate the specification of schema properties with the validation of those properties on some data. We may want to separate these two concerns.

DataFrameSchema collects all column types and checks and does some basic schema validations to make sure the specification is valid (raises SchemaInitError if invalid).
DataFrameSchema.validate should delegate the validation of some input data to a ValidationEngine. The validation engine performs the following operations:
- checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional)
- checks dataframe column order against schema columns order (optional)
- coerces columns to types specified (optional)
- expands schema regex columns based on dataframe columns
- run schema component (column/index) checks
  - check for nulls (optional)
  - check for duplicates (optional)
  - check datatype
  - run Check validations
- run dataframe-level checks
_CheckBase needs to delegate the implementation of groupby, element_wise, agg, and potentially other modifiers (see here) to the underlying dataframe library via ValidationEngine.
the ValidationEngine would also have to supply implementation for built-in Checks. This can happen incrementally such that an error is raised if the implementation isn't done for a particular dataframe library.
the strategies module needs to be extended to support other dataframe types. Since hypothesis supports numpy and pandas it makes sense to use the existing strategies logic to generate a pandas dataframe and convert it to some other desired format (e.g. koalas, modin, dask, etc) and see how far that gets us.

Here's a high-level sketch of the API:

# pandera contributor to codebase or custom third-party engine
class MySpecialDataFrameValidationEngine(ValidationEngine):
    # implement a bunch of stuff
    ...

register_validation_engine(MySpecialDataFrameValidationEngine)

# end-user interaction, with hypothetical special_dataframe package.
from special_dataframe import MySpecialDataFrame

special_df = MySpecialDataFrame(...)

schema = pa.DataFrameSchema({...})
schema.validate(special_df)

jeffzi · 2021-04-11T19:36:39Z

checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional)
checks dataframe column order against schema columns order (optional)
expands schema regex columns based on dataframe columns

I think those operations can be handled by DataFrameSchema, provided that the engine exposes get_columns(df)/set_columns(df). "columns" here refers to a list of pandera.Column.
edit: It occurred to me that this idea may be too restrictive for multi-dimensional dataframes (like xarray), unless DataFrameSchema knows about multi-dimensions.

We could merge the idea of Backend outlined in #369 with ValidationEngine. That would add the responsibility of registering dtypes.

Question: What to do with pandera.Index? Most DataFrame libraries don't have this concept. If we want to minimize the proliferation of classes in the public-facing API, which I totally agree with, we need to keep set_index()/reset_index() on DataFrameSchema but raise an error if the engine does not support it.

crypdick · 2021-06-15T21:51:02Z

Any ETA on Modin support?

cosmicBboy · 2021-06-16T13:31:30Z

hey @crypdick once #504 is merged (should be in the next few days) I'll going to tackle this issue.

The plan right now is to make a ValidationEngine base class and PandasValidationEngine with native support for pandas, modin, and koalas.

I've done a little bit of prototyping of the new validation engine but still needs a bunch of work... I'm going to push for a finished solution before scipy conf this year, so ETA mid-July?

kvnkho · 2021-09-18T19:19:51Z

Went through the discussion and we'd certainly be interested in contributing a Fugue ValidationEngine. We'll keep an eye out for the PandasValidationEngine and the koalas/modin support and see if Fugue has direct mappings to the implementation you arrive at!

JackKelly · 2021-10-08T08:31:57Z

Hi, I was just wondering if it's possible to use pandera to define schemas for n-dimensional numpy arrays; and hence to use pandera with xarray.DataArray objects, just as pandera is currently used for pandas.DataFrames?

cosmicBboy · 2021-10-11T15:42:11Z

@JackKelly I'd love to add support for numpy+xarray, but unfortunately it's currently not possible.

After this PR is merged (still WIP) we'll have a much better interface for extending pandera to other non-pandas data structures, numpy and xarray would be natural to support on pandera.

Out of curiosity (looking at openclimatefix/nowcasting_dataset#211) is your primary use-case to check data types and dimensions of xarray objects?

JackKelly · 2021-10-11T16:02:23Z

Thanks loads for the reply! No worries at all!

Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.

cosmicBboy · 2021-10-11T16:25:26Z

Thanks loads for the reply! No worries at all!

Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.

Great! will keep this in mind for when we get there.

Also, once pandera schemas can be used as valid pydantic types, #453 is supported, the solution you outline here would be pretty straightforward to port over to pandera, making for a pretty concise schema definition... I'm imagining a user-API like:

import pandera as pa
import pydantic

class ImageDataset(pa.SchemaModel)    
    data: DataArray[int] = NDField(dims=("time", "x", "y"))
    x_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
    y_coords: Optional[DataArray[int]] = NDField(dims=("index", ))


class Example(pydantic.BaseModel):
    """A single machine learning training example."""
    satellite: Optional[ImageDataset]
    nwp: Optional[ImageDataset]

JackKelly · 2021-10-12T06:43:08Z

That looks absolutely perfect, thank you!

jhamman · 2021-12-06T04:50:52Z

Hi all. I wanted to share a little experiment we've been playing with, xarray-schema, which provides schema validation logic for Xarray objects. We've been following this thread closely and we're looking at ways to integrate what we've done with pandera/pydantic.

cosmicBboy · 2021-12-06T14:01:14Z

wow @jhamman this looks amazing! I'd love to integrate, do you want to find a time to chat?
https://calendly.com/niels-bantilan/30min

Also feel free to join the discord community if you want to discuss further there: https://discord.gg/vyanhWuaKB

cosmicBboy · 2021-12-11T15:42:11Z

@jhamman made this issue #705 to track the xarray integration.

I'm planning on making a PR for this issue (#381) by end of year to make the xarray-schema integration as smooth as possible.

blais · 2022-02-01T01:27:54Z

Thanks for your email Niels.
PETL allows one to process tables of data.
It involves several differences and some advantages over Pandas:

The data storage is a lot more straightforward - no indices, regular Python objects (no Pandas-specific dtypes)
As a result, the code is much more predictable.Pandas is often quirky and leads to silent failures hard to predict. In comparison, I nearly 100% of the time get things right the first time with PETL (and I have solid experience with Pandas).
PETL allows you to keep only a portion of the dataframe in memory.
PETL is row-based, not column based, so depending on the operation, some of the processing is not available compared to Pandas. In-row and near-row operations are still possible though.
PETL is lazy evaluated by default, it's only at the point of producing the output that the data is pulled through its processing pipeline. This has advantages - small memory footprint - but also disadvantages - e.g. using a closure may have sometimes difficult-to-predict behavior because it actually gets executed way after its point of definition.

Overall, I think for 90% of the processing I've seen done in Pandas, PETL is a better choice. For the remaining 10% Pandas is needed more like NumPy.

Having schemas for PETL would be awesome. Its support should be much easier than for Pandas - as I mentioned, it doesn't define custom data types, the data representation model is really straightforward: lists of (lists or tuples) or any Python object.

andretheronsa · 2022-04-12T07:04:36Z

What would be required to ensure we can add a GeoDataFrame type from GeoPandas with a Pydantic BaseModel?
I am thinking it may not be as complex as support for spark/dask and new interfaces. If someone could point me in the right direction I could work on a PR.

I would like to do:

import pandera as pa
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
from pandera.typing import Series
import pydantic
from shapely.geometry import Polygon

class BaseGeoDataFrameSchema(pa.SchemaModel):
    geometry: GeoSeries
    properties: Optional[Series[str]]

class Inputs(pydantic.BaseModel):
    gdf: GeoDataFrame[BaseGeoDataFrameSchema]
   # TypeError: Fields of type "<class 'pandera.typing.geopandas.GeoDataFrame'>" are not supported.

gdf = GeoDataFrame[BaseGeoDataFrameSchema]({"geometry": [Polygon(((0, 0), (0, 1), (1, 1), (1, 0)))], "extra": [1]}, crs=4326)
validated_inputs = Inputs(gdf=gdf)

cosmicBboy · 2022-08-12T19:38:47Z

hi all, pinging this issue to point everyone to this PR: #913

It's a WIP PR for laying the groundwork for improving the extensibility of pandera's abstractions. I'd very much appreciate people's feedback on this, nothing is set in stone yet!

I'll be adding additional details to the PR description in the next few days, but for now it outlines the main changes at a high level. Please chime in with your thoughts/comments!

cosmicBboy added the enhancement New feature or request label Jan 12, 2021

cosmicBboy added this to the 0.8.0 release milestone Feb 5, 2021

cosmicBboy changed the title ~~Refactor schema and schema components into base classes~~ Support alternative validation backends Apr 10, 2021

cosmicBboy changed the title ~~Support alternative validation backends~~ Introduct validation engines to support alternative dataframe libraries Apr 10, 2021

cosmicBboy changed the title ~~Introduct validation engines to support alternative dataframe libraries~~ Introduce validation engines to support alternative dataframe libraries Apr 10, 2021

cosmicBboy mentioned this issue Apr 10, 2021

Decouple pandera and pandas type systems #369

Closed

cosmicBboy changed the title ~~Introduce validation engines to support alternative dataframe libraries~~ Support validation of non-pandas dataframes, e.g. spark, dask, modin Apr 11, 2021

cosmicBboy changed the title ~~Support validation of non-pandas dataframes, e.g. spark, dask, modin~~ Support validation of non-pandas dataframes, e.g. spark, dask, etc Apr 11, 2021

cosmicBboy self-assigned this May 16, 2021

cosmicBboy changed the title ~~Support validation of non-pandas dataframes, e.g. spark, dask, etc~~ Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc May 16, 2021

jeffzi mentioned this issue Jun 2, 2021

Refactor inference, schema_statistics, strategies and io using the DataType hierarchy #504

Merged

9 tasks

cosmicBboy mentioned this issue Sep 6, 2021

add support koalas and modin #601

Closed

JackKelly mentioned this issue Oct 8, 2021

Machine-readable schema & validator for xarray.Dataset openclimatefix/nowcasting_dataset#211

Closed

This was referenced Oct 13, 2021

feature/koalas-beta #651

Merged

support koalas #658

Merged

cosmicBboy mentioned this issue Dec 11, 2021

Xarray integration #705

Open

3 tasks

cosmicBboy mentioned this issue Jan 31, 2022

Consider support the PETL library #744

Open

cosmicBboy mentioned this issue May 10, 2022

missing from_record mehtod wich resturns DataFrame[Schema] #850

Closed

cosmicBboy mentioned this issue Jul 20, 2022

Add nullable column when missing. #687

Open

cosmicBboy mentioned this issue Aug 12, 2022

core and backend pandera API internals rewrite #913

Merged

ivirshup mentioned this issue Aug 18, 2022

Schemas? scverse/spatialdata#24

Closed

cosmicBboy mentioned this issue Oct 18, 2022

option to specify default value upon validation coercion #502

Closed

cosmicBboy removed this from the 0.8.0 release milestone Nov 16, 2022

cosmicBboy mentioned this issue Nov 16, 2022

Check_types doesnt check Series #1020

Closed

cosmicBboy mentioned this issue Jan 3, 2023

Support for polars #1064

Closed

cosmicBboy closed this as completed in #913 Jan 24, 2023

This was referenced Mar 13, 2023

internals rewrite: clean up checks and hypothesis functionality #1109

Merged

rename pandera.core to pandera.api #1110

Merged

abekfenn mentioned this issue Apr 15, 2024

Polars Support? great-expectations/great_expectations#9649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381

cosmicBboy commented Jan 12, 2021 •

edited

Loading

cosmicBboy commented Apr 10, 2021 •

edited

Loading

jeffzi commented Apr 11, 2021 •

edited

Loading

crypdick commented Jun 15, 2021

cosmicBboy commented Jun 16, 2021

kvnkho commented Sep 18, 2021

JackKelly commented Oct 8, 2021

cosmicBboy commented Oct 11, 2021

JackKelly commented Oct 11, 2021

cosmicBboy commented Oct 11, 2021 •

edited

Loading

JackKelly commented Oct 12, 2021

jhamman commented Dec 6, 2021

cosmicBboy commented Dec 6, 2021

cosmicBboy commented Dec 11, 2021

blais commented Feb 1, 2022

andretheronsa commented Apr 12, 2022 •

edited

Loading

cosmicBboy commented Aug 12, 2022

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381

Comments

cosmicBboy commented Jan 12, 2021 • edited Loading

cosmicBboy commented Apr 10, 2021 • edited Loading

Initial Thoughts

jeffzi commented Apr 11, 2021 • edited Loading

crypdick commented Jun 15, 2021

cosmicBboy commented Jun 16, 2021

kvnkho commented Sep 18, 2021

JackKelly commented Oct 8, 2021

cosmicBboy commented Oct 11, 2021

JackKelly commented Oct 11, 2021

cosmicBboy commented Oct 11, 2021 • edited Loading

JackKelly commented Oct 12, 2021

jhamman commented Dec 6, 2021

cosmicBboy commented Dec 6, 2021

cosmicBboy commented Dec 11, 2021

blais commented Feb 1, 2022

andretheronsa commented Apr 12, 2022 • edited Loading

cosmicBboy commented Aug 12, 2022

cosmicBboy commented Jan 12, 2021 •

edited

Loading

cosmicBboy commented Apr 10, 2021 •

edited

Loading

jeffzi commented Apr 11, 2021 •

edited

Loading

cosmicBboy commented Oct 11, 2021 •

edited

Loading

andretheronsa commented Apr 12, 2022 •

edited

Loading