-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381
Comments
Initial ThoughtsCurrently, the schema and check classes conflate the specification of schema properties with the validation of those properties on some data. We may want to separate these two concerns.
Here's a high-level sketch of the API: # pandera contributor to codebase or custom third-party engine
class MySpecialDataFrameValidationEngine(ValidationEngine):
# implement a bunch of stuff
...
register_validation_engine(MySpecialDataFrameValidationEngine)
# end-user interaction, with hypothetical special_dataframe package.
from special_dataframe import MySpecialDataFrame
special_df = MySpecialDataFrame(...)
schema = pa.DataFrameSchema({...})
schema.validate(special_df) |
I think those operations can be handled by We could merge the idea of Question: What to do with |
Any ETA on Modin support? |
hey @crypdick once #504 is merged (should be in the next few days) I'll going to tackle this issue. The plan right now is to make a I've done a little bit of prototyping of the new validation engine but still needs a bunch of work... I'm going to push for a finished solution before scipy conf this year, so ETA mid-July? |
Went through the discussion and we'd certainly be interested in contributing a Fugue |
Hi, I was just wondering if it's possible to use |
@JackKelly I'd love to add support for numpy+xarray, but unfortunately it's currently not possible. After this PR is merged (still WIP) we'll have a much better interface for extending pandera to other non-pandas data structures, numpy and xarray would be natural to support on pandera. Out of curiosity (looking at openclimatefix/nowcasting_dataset#211) is your primary use-case to check data types and dimensions of xarray objects? |
Thanks loads for the reply! No worries at all! Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays. |
Great! will keep this in mind for when we get there. Also, once pandera schemas can be used as valid pydantic types, #453 is supported, the solution you outline here would be pretty straightforward to port over to pandera, making for a pretty concise schema definition... I'm imagining a user-API like: import pandera as pa
import pydantic
class ImageDataset(pa.SchemaModel)
data: DataArray[int] = NDField(dims=("time", "x", "y"))
x_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
y_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
class Example(pydantic.BaseModel):
"""A single machine learning training example."""
satellite: Optional[ImageDataset]
nwp: Optional[ImageDataset] |
That looks absolutely perfect, thank you! |
Hi all. I wanted to share a little experiment we've been playing with, xarray-schema, which provides schema validation logic for Xarray objects. We've been following this thread closely and we're looking at ways to integrate what we've done with pandera/pydantic. |
wow @jhamman this looks amazing! I'd love to integrate, do you want to find a time to chat? Also feel free to join the discord community if you want to discuss further there: https://discord.gg/vyanhWuaKB |
Thanks for your email Niels.
Overall, I think for 90% of the processing I've seen done in Pandas, PETL is a better choice. For the remaining 10% Pandas is needed more like NumPy. Having schemas for PETL would be awesome. Its support should be much easier than for Pandas - as I mentioned, it doesn't define custom data types, the data representation model is really straightforward: lists of (lists or tuples) or any Python object. |
What would be required to ensure we can add a GeoDataFrame type from GeoPandas with a Pydantic BaseModel? I would like to do: import pandera as pa
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
from pandera.typing import Series
import pydantic
from shapely.geometry import Polygon
class BaseGeoDataFrameSchema(pa.SchemaModel):
geometry: GeoSeries
properties: Optional[Series[str]]
class Inputs(pydantic.BaseModel):
gdf: GeoDataFrame[BaseGeoDataFrameSchema]
# TypeError: Fields of type "<class 'pandera.typing.geopandas.GeoDataFrame'>" are not supported.
gdf = GeoDataFrame[BaseGeoDataFrameSchema]({"geometry": [Polygon(((0, 0), (0, 1), (1, 1), (1, 0)))], "extra": [1]}, crs=4326)
validated_inputs = Inputs(gdf=gdf) |
hi all, pinging this issue to point everyone to this PR: #913 It's a WIP PR for laying the groundwork for improving the extensibility of pandera's abstractions. I'd very much appreciate people's feedback on this, nothing is set in stone yet! I'll be adding additional details to the PR description in the next few days, but for now it outlines the main changes at a high level. Please chime in with your thoughts/comments! |
Is your feature request related to a problem? Please describe.
Extending pandera to non-pandas dataframe-like structures is a challenge today because the schema and schema component class definitions are strongly coupled with the pandas API. For example, the
DataFramesSchema.validate
method assumes that validated objects follow the pandas API.Potential Solutions
Schema
,SchemaComponent
, andCheck
abstract base classes so that core and third-party pandera schemas can be easily developed on top of it. Subclasses of these base classes would implement the validation logic for a specific library, e.g.SparkSchema
,PandasSchema
, etc.obj
whenschema.validate(obj)
is called.engine: str
option, to explicitly specify which engine to use. (q: should this be in__init__
orvalidate
or both?)Describe the solution you'd like
Because this is quite a momentous change in the pandera's scope (to support not just pandas dataframes), I'll first re-iterate the design philosophy of pandera:
In keeping with these principles, I propose going with solution (2), in order to prevent an increase in the complexity and surface area of the user-facing API (
DaskSchema
,PandasSchema
,SparkSchema
,VaexSchema
, etc).edit:
Actually with solution (1), one approach that would keep the API surface area small is to use a subpackage pattern that replicates the pandera interface but with the alternative backend:
Etc...
Will need to think through the pros and cons of 1 vs 2 some more...
Re: data synthesis strategies, which is used purely for testing and not meant to generate massive amounts of data, we could just fallback on pandas and convert the synthesized data to the corresponding dataframe type, assuming the df library supports this, e.g. spark.createDataFrame
The text was updated successfully, but these errors were encountered: