Support for `polars` #1064

fzyzcjy · 2023-01-03T08:13:20Z

Hi thanks for the lib! I wonder it can support type checking for polars?

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2023-01-03T20:46:30Z

Hi @fzyzcjy would love to support polars! Doing so is currently blocked by #381, which I'm trying to get done ASAP, as it'll unblock support for a lot of different data frameworks, including polars.

fzyzcjy · 2023-01-03T23:46:54Z

Thanks, looking forward to it!

igmriegel · 2023-01-23T20:26:27Z

Thanks, looking forward to it!

Me too!!

AndriiG13 · 2023-02-09T18:21:54Z

Thanks, looking forward to it!

Same here, would also be happy to contribute to this one!

francesco086 · 2023-02-09T18:45:56Z

Same, happy to help :)

igmriegel · 2023-02-09T18:48:27Z

Hello, I don't really know if it helps, but I wanted to share this project

https://github.com/kolonialno/patito

They paired Pydantic and Polars, they are offering some functionalities Pandera offers.
Maybe we could fork something or use as inspiration?

I'm not really experienced, but I'm wiling to help too. 😃

cosmicBboy · 2023-02-09T21:07:21Z

hi all! so since merging the pandera internals re-write: #913

Support for polars is technically unblocked! I'm still working on the docs for extending pandera with custom schema specs and backends, but basically here's a rough roadmap for supporting polars:

Support for `pandera[polars]`

Support for polars can come in two phases, both of which are actually independent from each other.

Add support for an ibis backend. Their polars support is currently experimental, but this is a high-leverage integration that also supports a bunch of other execution backends. The idea here is that if pandera supports ibis as a backend, users get access to a bunch of other backends such as duckdb, mysql, postgres, etc. in-database validation here we come! 🚀
Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

The limitation of (1) would be that if you want to write custom checks, you'd have to do it with the ibis API. With (2), you'd be able to write custom checks (e.g. here) with the polars API.

What if I want to use pandera to validate polars dataframes but don't want the pandas dependency?

Currently pandera has a hard dependency on pandas, which is pretty much ubiquitous in data eng/data science/ML stacks, but in case folks want to use pandera-polars in a limited context (e.g. AWS Lambda) and want to minimize dependencies, there is a longer term plan for this. Basically, we can either:

Do a breaking pandera==1.0.0 release, where users have to explicitly install pandera[pandas] for the pandas DataFrame validation, then organize the library to support for other backends, e.g. pandera[polars], pandera[ibis], etc.
Rip out the pandera.core and pandera.backend modules into an upstream library pandera-core, so that we can create a contrib or plugin package, e.g. pandera-polars doesn't have to depend on pandas, and can be installed independently as pip install pandera-polars

Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086

fzyzcjy · 2023-02-10T00:54:14Z

I use both polars and pandas, so do not have any thoughts - all are totally acceptable. Good job and cannot wait to use it!

gab23r · 2023-02-10T08:47:08Z

As a user it would nice to have only one package. And this package would have no strict dependances. So I am clearly in favor of option one here! Otherwise we would end up with pandera-polars, pandera-ibis, pandera-vaex, ... And the list will grow, and it will be more complicated to manage from the user perspective.

francesco086 · 2023-02-10T11:53:20Z

Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086

Thinking loud (and please correct me if I am wrong):

Option 1 means all the code will be in one single repository. Option 2 means that there will be many repos, with inter-dependencies (e.g. pandera-pandas will have a dependency on pandera-core as specified in pyproject.toml). This means that the various pandera-*df_engine* will have the possibility to be developed at different speed. If you make a major release of pandera-core you don't need to immediately update all engines, as they can keep relying on the old version. -> Option 2 is more modular
Option 1 means that when I install pandera (without extras) and try do do something that requires an extra, I will get an error informing me that I need to install the extras and I can do it straight-away. With option 2 I could have compatibility issues, if for example pandera-pandas and pandera-polars rely on incompatible versions on pandera-core (or others). -> Option 1 make it easier to work with many df engines in the same venv
Option 2 means smaller codebases that are probably easier to get into and manage (to attract contributors)
Option 2 forces to have pandera-core to have a neat public interface that can be re-used without "hacks"

All in all I am in favor of Option 2. My point 2. above is not very important in my opinion, if you really need to work with two different df engines, you can always do it in two separated venvs.

AndriiG13 · 2023-02-11T08:23:32Z

As a user I think both phases make sense. As I understand, the ibis support would especially be nice for folks who are using different df engines in their project, since they can reuse checks defined in ibis api across the engines.

At the same time I think it's good to have a Polars native solution.

So I like both, but frankly I'm ignorant to the possible package management implications mentioned by others above.

igmriegel · 2023-02-12T05:26:56Z

2. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

I think we should not bring a third dependency to be able to serve Polars and I agree with francesco086 considerations about pandera-core.

cosmicBboy · 2023-02-12T18:21:07Z

Cool, thanks for the discussion all!

So re: the polars-support roadmap, I'll plan on working on the ibis backend integration as a n=2 sample for how well the pandera core/backend abstractions fit into supporting another non-pandas-API framework.

Help Needed!

Will definitely need some help designing/implementing the polars-native backend: will need to ramp up on the python polars API myself, but would anyone on this thread be willing to help out?

Design

assessing whether the attributes of DataFrameSchema and Columns fit with the polars API. I'm aware that it doesn't have a notion of Index (which sounds awesome actually 😎) but if there's anything besides columns that need to be validated in a polars dataframe that would be good to know
assessing whether the backend model fits into how polars works
- DataFrameSchemaBackend: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pandas/container.py#L26
- ColumnBackend: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pandas/components.py#L23

Implementation

Eventually will also need help implementing:

the built-in checks, but with polars: https://github.com/unionai-oss/pandera/blob/main/pandera/core/pandas/checks.py
creating data synthesis strategies for polars: https://github.com/unionai-oss/pandera/blob/main/pandera/strategies/pandas_strategies.py

Please give a 👍 to this comment if you'll be able to help with one or more of the above

StefanBRas · 2023-02-12T18:54:23Z

creating data synthesis strategies for polars: https://github.com/unionai-oss/pandera/blob/main/pandera/strategies/pandas_strategies.py

Polars themselves ship data synthesis functions for use with hypothesis: Api reference link.

ritchie46 · 2023-02-13T09:33:10Z

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.

francesco086 · 2023-02-14T11:16:17Z

Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)

ritchie46 · 2023-02-14T11:19:10Z

Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)

If you are familiar with pandera. Please join our discord. We can open a pandera thread and we can help one snippet at a time.

francesco086 · 2023-02-14T12:55:21Z

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.

@ritchie46 One important aspect to keep in mind is that pandera has schema models, which is much more than column names and types. For example, a pandera schema could describe and check the constraint col_a + col_b = col_c.
So I am not sure about the LazyFrames, I don't think that it is possible to validate them before actually doing a computation.

cosmicBboy · 2023-02-15T14:11:08Z

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries.

I think this will be very valuable type checking and perhaps other dataframe metadata, though a limitation would be that it wouldn't be able to apply checks on actual values (e.g. pandera.Check.ge(0)) before running the code (unless I'm missing something conceptually). This is fine, I think, as long as the UX for applying pandera schemas to LazyFrames is clarified to be only on dataframe metadata (data types, column names, etc.)

Regardless, LazyFrame validation would definitely a huge plus!

kuatroka · 2023-02-23T14:18:47Z

Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

If you look to future-proof pandera, expand the number of users and be strategic about it, the way to go would be to go full throttle on an independent polars support and prioritise it over ibis. ibis project is roughly 9 years old and has 2.5K stars. polars is roughly 3 years old and 14.3K stars. Many people say stars are a vanity metrics, but I disagree. They are still metrics and they do show what is being used more.

To be clear, ibis is great, polars is great, but if it's about sequencing new feature development and effort allocation per unit of an immediate usefulness and reaching the widest possible user number, I'd suggest to go for an independent full polars support first.

P.D. pandera is just awesome.

cosmicBboy · 2023-03-01T19:25:59Z

@kuatroka good feedback!

My short-term priority is still to add an ibis backend, since the motivation there is to be able to enable in-database validation for a number of supported DBs (postgres, mysql, etc). Support for this has been in demand for a while now, the nice side-effect is that it adds (experimental) support for polars.

That said, I'm all for polars support. Would love community contributions on this, but I owe all of you a comprehensive set of docs first on how to extend pandera with your own schema specification and backends (I'm working on this now!)

the-matt-morris · 2023-03-25T03:58:09Z

Jumping in a little late here, but as a user of both pandera and polars (and love both libraries), I'd be willing to contribute to make this happen so I don't have to also add pandas as a dependency in my pipelines and perform the validation portion on pandas dataframes!

blais · 2023-04-08T18:33:06Z

Adding a +1 for Polars schemas!

kykyi · 2023-06-04T20:49:26Z

@cosmicBboy keen to help 🚀

Sounds like you are prioritising the ibis integration which I'd be keen to look at, do you have any workings on this yet?

Or if you are wanting to focus on ibis I could start spiking out what an independent polars module could look like 👌

kykyi · 2023-06-05T08:31:36Z

If I understand correctly, it will be a matter of filling out the yellow PolarsSchemaBackend and IbisBacked branches?

lior5654 · 2023-07-26T09:26:33Z

Important Note:

I think the DataFrameModel definition should be
as agnostic as possible to the dataframe library used.

This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.

Note: Of course, except "edge cases" (indicies, struct types, etc').

cosmicBboy · 2023-07-26T12:48:52Z

This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.

I think this is a worthy goal, barring a few technical challenges on making this all work nice with multiple dataframe generic types, see this issue.

For now, though, each library can get its own DataFrameModel type, which would can eventually all merge together for the DataFrameModel to rule them all.

rmorshea · 2023-09-05T21:07:47Z

Haven't read through this whole conversation, but I wanted to drop a link to this DataFrame API standard in case it hadn't been mentioned and, if it hadn't, so that it might help in creating "one DataFrameModel to rule them all".

cosmicBboy · 2023-09-21T18:00:12Z

@rmorshea I've been keeping tabs on that project! How mature would you say it is i.e. is it ready for prime time?

rmorshea · 2023-09-21T23:43:07Z

According to the README it's not out of the draft stage. This issue from 3 weeks ago seems to suggest that things haven't quite crystalized, but it'd probably be best to ask the folks driving the project forward what the status is. If people from Pandera feel like they have a vested interest in a standard like that, I'm sure it would benefit from more contributors.

FilipAisot · 2023-09-22T07:31:41Z

Are we starting with this thing? I am ready to do some work! Let's get the ball rolling.

cosmicBboy · 2023-09-28T03:06:33Z

@FilipAisot yes! I was pulled a different direction for the past few weeks, but will have some bandwidth now to help push this along.

I just made a new polars-dev branch to keep track of all the work for polars support, I'll be pushing up a few changes by the end of this week with stub modules for all the basic pieces needed, then we can divvy up the work across the schema, components, checks, model, and type engine as described here

cosmicBboy · 2023-10-07T04:34:00Z

Okay, to all the folks interested in contributing to this effort: let's kick-off development work to support polars LazyFrames!

Head over here if you just want to start digging into the code
👉 #1373

The PR contains basic functionality and unit tests for supporting pl.LazyFrame validation.

Efforts

The major pieces of work are:

Implement a api and backends module for each polars data structure we want to support. Basic pl.LazyFrame support is here: DataFrameSchema, DataFrameSchemaBackend.
Built-in checks: This would cover the currently available built-in checks . See ge check here.
Pandera type system integration: pandera has a type system for machine- and logical- datatypes (see here for details). This will essentially be a mapping between polars datatypes and the pandera standard data types. Since polars uses Arrow a widely-used data type system, it would be a good time to implement this.
Implement DataFrameModel support for LazyFrames. This would allow for the dataclass-like schema definitions for dataframes.
Consolidate DataFrameSchema API: This is sort of a meta task after 1-4 are more complete, but this would involve attempting to create a common, shared DataFrameSchema definition such that a single schema can validate pandas, pyspark, and polars DataFrames (this is something I can own).

For the rest of 1-4, if anyone's down to contribute to one or more of these efforts please say so in the comments below, I can help point you the right direction and discuss (perhaps in discord if you want to sync up there)

Initial Prototype

The PR referenced above currently contains a basic proof of concept.

For now, you can pipe schemas through a query, which implicitly will call ldf.collect() on all of the metadata and data value checks:

import polars as pl
import pandera.polars as pa
from pandera import Check as C

ldf  = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [0, 1, 2]}).lazy()

schema = pa.DataFrameSchema(
    {
        "string_col": pa.Column(pl.Utf8),
        "int_col": pa.Column(pl.Int64, C.ge(0)),
    }
)

q = ldf.pipe(schema.validate)
df = q.collect()

Raise error:

invalid_ldf  = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [-1, 1, 2]}).lazy()
q = invalid_ldf.pipe(schema.validate, lazy=True)
q.collect()


SchemaErrors: Schema None: A total of 1 errors were found.

shape: (2, 5)
          ┌──────────────┬────────────────┬─────────┬─────────────────────────────┬──────────────┐
          │ failure_case ┆ schema_context ┆ column  ┆ check                       ┆ check_number │
          │ ---          ┆ ---            ┆ ---     ┆ ---                         ┆ ---          │
          │ i64          ┆ str            ┆ str     ┆ str                         ┆ i32          │
          ╞══════════════╪════════════════╪═════════╪═════════════════════════════╪══════════════╡
          │ -1           ┆ Column         ┆ int_col ┆ greater_than_or_equal_to(0) ┆ 0            │
          └──────────────┴────────────────┴─────────┴─────────────────────────────┴──────────────┘

In exploring polar's programming model, there are some cool things we can do with the pandera internals to do things like decoupling validation at query definition time (just checking the column data types) vs query collection time (the data value checks that pandera does). I think this is a great follow-up effort once the basic functionality is implemented.

FilipAisot · 2023-10-09T08:33:23Z

Happy to be of help @cosmicBboy. Point me in any direction you see fit. We can also discuss it on Discord.

AndriiG13 · 2023-10-09T15:39:48Z

I would definitely need some time to go over the code to get an understanding, but I'm keen to look into 'Built-in checks'!

ilyanoskov · 2024-01-18T20:27:30Z

This is very much needed 🙏

cosmicBboy · 2024-01-18T21:16:53Z

@ilyanoskov heard! I took a few weeks break from pandera, but am back now and will continue work on this

ilyanoskov · 2024-01-18T21:18:12Z

@cosmicBboy thank you very much for all your amazing work with Pandera!

leycec · 2024-02-19T05:37:57Z

@beartype lead @leycec here. @beartype has officially supported Pandera for a few release cycles now. We're Team Pandera.

I'm increasingly fielding feature requests like beartype/beartype#329, where users are begging for generic typing of Pandas and Polars DataFrame objects. Polars is rapidly eating Pandas' lunch, thanks to being intrinsically multithreaded and stupidly fast. This is sorta like how JAX rapidly ate NumPy and SciPy's lunch... and for the exact same reason.

tl;dr: When Pandera does this, Pandera wins GitHub. Please win GitHub.

cosmicBboy · 2024-05-06T01:35:09Z

alright folks! With the docs update PR #1613 and many bugfixes that were unearthed during the beta, official polars support is ready for prime time 🚀

Gonna cut a 0.19.0 release tonight. I suspect there will be more bugs after this, so please give it a try and report them here!

blais · 2024-05-06T01:43:20Z

That's really great!

yehoshuadimarsky · 2024-05-06T02:59:11Z

amazing!

cosmicBboy · 2024-05-06T14:31:19Z

Here it is: https://github.com/unionai-oss/pandera/releases/tag/v0.19.0 🚀. Again I wanted to thank everyone who contributed PRs, filed bug reports, and provided overall good vibes to supporting this feature 🙂 was super fun for me to learn polars.

Please open bug reports, feature requests, and PRs (especially things that you may want from pandera's existing feature set that isn't currently supported).

kszlim · 2024-05-23T18:08:47Z

Curious if anyone knows whether https://pandera.readthedocs.io/en/stable/pydantic_integration.html#pydantic-integration is going to be supported for polars and whether there's a tracking issue for that?

cosmicBboy · 2024-05-24T13:53:52Z

@kszlim this wasn't in scope for the initial integration, but feel free to make an issue!

philiporlando · 2024-07-01T18:45:16Z

A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release?

cosmicBboy · 2024-07-01T22:58:37Z

A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release?

We should absolutely support polars 1. Can you make an issue outlining what the breaking changes are with respect to the parts of the api used in pandera?

fzyzcjy added the enhancement New feature or request label Jan 3, 2023

cosmicBboy mentioned this issue Mar 13, 2023

rename pandera.core to pandera.api #1110

Merged

MarcoGorelli mentioned this issue Mar 17, 2023

potentially relevant usage patterns / targets for a developer-focused API data-apis/dataframe-api#71

Open

cosmicBboy changed the title ~~Support polars?~~ Support for polars Mar 23, 2023

cosmicBboy mentioned this issue Mar 23, 2023

Support pyspark.sql.DataFrame #1138

Closed

This was referenced Oct 2, 2023

Design Data Types Library That Supports Both PySpark & Pandas #1360

Open

Support polars DataFrames, LazyFrames #1373

Merged

This was referenced Nov 13, 2023

Keeping track of Polars DataTypes for Polars schemas support #1422

Open

Add built-in checks for Polars schemas support #1421

Closed

cosmicBboy mentioned this issue Nov 26, 2023

Implement polars LazyFrame backend and core checks #1432

Open

edgBR mentioned this issue Mar 22, 2024

Polars Support? great-expectations/great_expectations#9649

Closed

cosmicBboy closed this as completed May 6, 2024

cosmicBboy mentioned this issue Jun 23, 2024

Implement basic validation backend for Ibis tables (alternative) #1651

Closed

Support for polars #1064

Support for polars #1064

Comments

fzyzcjy commented Jan 3, 2023

cosmicBboy commented Jan 3, 2023

fzyzcjy commented Jan 3, 2023

igmriegel commented Jan 23, 2023

AndriiG13 commented Feb 9, 2023

francesco086 commented Feb 9, 2023

igmriegel commented Feb 9, 2023

cosmicBboy commented Feb 9, 2023 • edited Loading

Support for pandera[polars]

What if I want to use pandera to validate polars dataframes but don't want the pandas dependency?

fzyzcjy commented Feb 10, 2023

gab23r commented Feb 10, 2023

francesco086 commented Feb 10, 2023 • edited Loading

AndriiG13 commented Feb 11, 2023

igmriegel commented Feb 12, 2023 • edited Loading

cosmicBboy commented Feb 12, 2023

Help Needed!

Design

Implementation

StefanBRas commented Feb 12, 2023

ritchie46 commented Feb 13, 2023

francesco086 commented Feb 14, 2023

ritchie46 commented Feb 14, 2023

francesco086 commented Feb 14, 2023

cosmicBboy commented Feb 15, 2023

kuatroka commented Feb 23, 2023 • edited Loading

cosmicBboy commented Mar 1, 2023

the-matt-morris commented Mar 25, 2023

blais commented Apr 8, 2023

kykyi commented Jun 4, 2023

kykyi commented Jun 5, 2023

lior5654 commented Jul 26, 2023 • edited Loading

cosmicBboy commented Jul 26, 2023

rmorshea commented Sep 5, 2023

cosmicBboy commented Sep 21, 2023

rmorshea commented Sep 21, 2023 • edited Loading

FilipAisot commented Sep 22, 2023

cosmicBboy commented Sep 28, 2023 • edited Loading

cosmicBboy commented Oct 7, 2023 • edited Loading

Efforts

Initial Prototype

FilipAisot commented Oct 9, 2023

AndriiG13 commented Oct 9, 2023

ilyanoskov commented Jan 18, 2024

cosmicBboy commented Jan 18, 2024

ilyanoskov commented Jan 18, 2024

leycec commented Feb 19, 2024

cosmicBboy commented May 6, 2024

blais commented May 6, 2024

yehoshuadimarsky commented May 6, 2024

cosmicBboy commented May 6, 2024 • edited Loading

kszlim commented May 23, 2024

cosmicBboy commented May 24, 2024

philiporlando commented Jul 1, 2024

cosmicBboy commented Jul 1, 2024

Support for `polars` #1064

Support for `polars` #1064

cosmicBboy commented Feb 9, 2023 •

edited

Loading

Support for `pandera[polars]`

francesco086 commented Feb 10, 2023 •

edited

Loading

igmriegel commented Feb 12, 2023 •

edited

Loading

kuatroka commented Feb 23, 2023 •

edited

Loading

lior5654 commented Jul 26, 2023 •

edited

Loading

rmorshea commented Sep 21, 2023 •

edited

Loading

cosmicBboy commented Sep 28, 2023 •

edited

Loading

cosmicBboy commented Oct 7, 2023 •

edited

Loading

cosmicBboy commented May 6, 2024 •

edited

Loading