Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for polars #1064

Closed
fzyzcjy opened this issue Jan 3, 2023 · 67 comments
Closed

Support for polars #1064

fzyzcjy opened this issue Jan 3, 2023 · 67 comments
Labels
enhancement New feature or request

Comments

@fzyzcjy
Copy link

fzyzcjy commented Jan 3, 2023

Hi thanks for the lib! I wonder it can support type checking for polars?

@fzyzcjy fzyzcjy added the enhancement New feature or request label Jan 3, 2023
@cosmicBboy
Copy link
Collaborator

Hi @fzyzcjy would love to support polars! Doing so is currently blocked by #381, which I'm trying to get done ASAP, as it'll unblock support for a lot of different data frameworks, including polars.

@fzyzcjy
Copy link
Author

fzyzcjy commented Jan 3, 2023

Thanks, looking forward to it!

@igmriegel
Copy link

Thanks, looking forward to it!

Me too!!

@AndriiG13
Copy link
Contributor

Thanks, looking forward to it!

Same here, would also be happy to contribute to this one!

@francesco086
Copy link

Same, happy to help :)

@igmriegel
Copy link

Hello, I don't really know if it helps, but I wanted to share this project

https://github.com/kolonialno/patito

They paired Pydantic and Polars, they are offering some functionalities Pandera offers.
Maybe we could fork something or use as inspiration?

I'm not really experienced, but I'm wiling to help too. 😃

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Feb 9, 2023

hi all! so since merging the pandera internals re-write: #913

Support for polars is technically unblocked! I'm still working on the docs for extending pandera with custom schema specs and backends, but basically here's a rough roadmap for supporting polars:

Support for pandera[polars]

Support for polars can come in two phases, both of which are actually independent from each other.

  1. Add support for an ibis backend. Their polars support is currently experimental, but this is a high-leverage integration that also supports a bunch of other execution backends. The idea here is that if pandera supports ibis as a backend, users get access to a bunch of other backends such as duckdb, mysql, postgres, etc. in-database validation here we come! 🚀
  2. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

The limitation of (1) would be that if you want to write custom checks, you'd have to do it with the ibis API. With (2), you'd be able to write custom checks (e.g. here) with the polars API.

What if I want to use pandera to validate polars dataframes but don't want the pandas dependency?

Currently pandera has a hard dependency on pandas, which is pretty much ubiquitous in data eng/data science/ML stacks, but in case folks want to use pandera-polars in a limited context (e.g. AWS Lambda) and want to minimize dependencies, there is a longer term plan for this. Basically, we can either:

  1. Do a breaking pandera==1.0.0 release, where users have to explicitly install pandera[pandas] for the pandas DataFrame validation, then organize the library to support for other backends, e.g. pandera[polars], pandera[ibis], etc.
  2. Rip out the pandera.core and pandera.backend modules into an upstream library pandera-core, so that we can create a contrib or plugin package, e.g. pandera-polars doesn't have to depend on pandas, and can be installed independently as pip install pandera-polars

Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086

@fzyzcjy
Copy link
Author

fzyzcjy commented Feb 10, 2023

I use both polars and pandas, so do not have any thoughts - all are totally acceptable. Good job and cannot wait to use it!

@gab23r
Copy link
Contributor

gab23r commented Feb 10, 2023

As a user it would nice to have only one package. And this package would have no strict dependances. So I am clearly in favor of option one here! Otherwise we would end up with pandera-polars, pandera-ibis, pandera-vaex, ... And the list will grow, and it will be more complicated to manage from the user perspective.

@francesco086
Copy link

francesco086 commented Feb 10, 2023

Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086

Thinking loud (and please correct me if I am wrong):

  1. Option 1 means all the code will be in one single repository. Option 2 means that there will be many repos, with inter-dependencies (e.g. pandera-pandas will have a dependency on pandera-core as specified in pyproject.toml). This means that the various pandera-*df_engine* will have the possibility to be developed at different speed. If you make a major release of pandera-core you don't need to immediately update all engines, as they can keep relying on the old version. -> Option 2 is more modular
  2. Option 1 means that when I install pandera (without extras) and try do do something that requires an extra, I will get an error informing me that I need to install the extras and I can do it straight-away. With option 2 I could have compatibility issues, if for example pandera-pandas and pandera-polars rely on incompatible versions on pandera-core (or others). -> Option 1 make it easier to work with many df engines in the same venv
  3. Option 2 means smaller codebases that are probably easier to get into and manage (to attract contributors)
  4. Option 2 forces to have pandera-core to have a neat public interface that can be re-used without "hacks"

All in all I am in favor of Option 2. My point 2. above is not very important in my opinion, if you really need to work with two different df engines, you can always do it in two separated venvs.

@AndriiG13
Copy link
Contributor

As a user I think both phases make sense. As I understand, the ibis support would especially be nice for folks who are using different df engines in their project, since they can reuse checks defined in ibis api across the engines.

At the same time I think it's good to have a Polars native solution.

So I like both, but frankly I'm ignorant to the possible package management implications mentioned by others above.

@igmriegel
Copy link

igmriegel commented Feb 12, 2023

2. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

I think we should not bring a third dependency to be able to serve Polars and I agree with francesco086 considerations about pandera-core.

@cosmicBboy
Copy link
Collaborator

Cool, thanks for the discussion all!

So re: the polars-support roadmap, I'll plan on working on the ibis backend integration as a n=2 sample for how well the pandera core/backend abstractions fit into supporting another non-pandas-API framework.

Help Needed!

Will definitely need some help designing/implementing the polars-native backend: will need to ramp up on the python polars API myself, but would anyone on this thread be willing to help out?

Design

  1. assessing whether the attributes of DataFrameSchema and Columns fit with the polars API. I'm aware that it doesn't have a notion of Index (which sounds awesome actually 😎) but if there's anything besides columns that need to be validated in a polars dataframe that would be good to know
  2. assessing whether the backend model fits into how polars works

Implementation

Eventually will also need help implementing:

Please give a 👍 to this comment if you'll be able to help with one or more of the above

@StefanBRas
Copy link

Polars themselves ship data synthesis functions for use with hypothesis: Api reference link.

@ritchie46
Copy link

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.

@francesco086
Copy link

Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)

@ritchie46
Copy link

Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)

If you are familiar with pandera. Please join our discord. We can open a pandera thread and we can help one snippet at a time.

@francesco086
Copy link

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.

@ritchie46 One important aspect to keep in mind is that pandera has schema models, which is much more than column names and types. For example, a pandera schema could describe and check the constraint col_a + col_b = col_c.
So I am not sure about the LazyFrames, I don't think that it is possible to validate them before actually doing a computation.

@cosmicBboy
Copy link
Collaborator

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries.

I think this will be very valuable type checking and perhaps other dataframe metadata, though a limitation would be that it wouldn't be able to apply checks on actual values (e.g. pandera.Check.ge(0)) before running the code (unless I'm missing something conceptually). This is fine, I think, as long as the UX for applying pandera schemas to LazyFrames is clarified to be only on dataframe metadata (data types, column names, etc.)

Regardless, LazyFrame validation would definitely a huge plus!

@kuatroka
Copy link

kuatroka commented Feb 23, 2023

  1. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

If you look to future-proof pandera, expand the number of users and be strategic about it, the way to go would be to go full throttle on an independent polars support and prioritise it over ibis. ibis project is roughly 9 years old and has 2.5K stars. polars is roughly 3 years old and 14.3K stars. Many people say stars are a vanity metrics, but I disagree. They are still metrics and they do show what is being used more.

To be clear, ibis is great, polars is great, but if it's about sequencing new feature development and effort allocation per unit of an immediate usefulness and reaching the widest possible user number, I'd suggest to go for an independent full polars support first.

P.D. pandera is just awesome.

@cosmicBboy
Copy link
Collaborator

@kuatroka good feedback!

My short-term priority is still to add an ibis backend, since the motivation there is to be able to enable in-database validation for a number of supported DBs (postgres, mysql, etc). Support for this has been in demand for a while now, the nice side-effect is that it adds (experimental) support for polars.

That said, I'm all for polars support. Would love community contributions on this, but I owe all of you a comprehensive set of docs first on how to extend pandera with your own schema specification and backends (I'm working on this now!)

@the-matt-morris
Copy link
Contributor

Jumping in a little late here, but as a user of both pandera and polars (and love both libraries), I'd be willing to contribute to make this happen so I don't have to also add pandas as a dependency in my pipelines and perform the validation portion on pandas dataframes!

@blais
Copy link

blais commented Apr 8, 2023

Adding a +1 for Polars schemas!

@kykyi
Copy link
Contributor

kykyi commented Jun 4, 2023

@cosmicBboy keen to help 🚀

Sounds like you are prioritising the ibis integration which I'd be keen to look at, do you have any workings on this yet?

Or if you are wanting to focus on ibis I could start spiking out what an independent polars module could look like 👌

@kykyi
Copy link
Contributor

kykyi commented Jun 5, 2023

If I understand correctly, it will be a matter of filling out the yellow PolarsSchemaBackend and IbisBacked branches?
image

@lior5654
Copy link

lior5654 commented Jul 26, 2023

Important Note:

I think the DataFrameModel definition should be
as agnostic as possible to the dataframe library used.

This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.

Note: Of course, except "edge cases" (indicies, struct types, etc').

@cosmicBboy
Copy link
Collaborator

This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.

I think this is a worthy goal, barring a few technical challenges on making this all work nice with multiple dataframe generic types, see this issue.

For now, though, each library can get its own DataFrameModel type, which would can eventually all merge together for the DataFrameModel to rule them all.

@rmorshea
Copy link

rmorshea commented Sep 5, 2023

Haven't read through this whole conversation, but I wanted to drop a link to this DataFrame API standard in case it hadn't been mentioned and, if it hadn't, so that it might help in creating "one DataFrameModel to rule them all".

@cosmicBboy
Copy link
Collaborator

@rmorshea I've been keeping tabs on that project! How mature would you say it is i.e. is it ready for prime time?

@rmorshea
Copy link

rmorshea commented Sep 21, 2023

According to the README it's not out of the draft stage. This issue from 3 weeks ago seems to suggest that things haven't quite crystalized, but it'd probably be best to ask the folks driving the project forward what the status is. If people from Pandera feel like they have a vested interest in a standard like that, I'm sure it would benefit from more contributors.

@FilipAisot
Copy link
Contributor

Are we starting with this thing? I am ready to do some work! Let's get the ball rolling.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Sep 28, 2023

@FilipAisot yes! I was pulled a different direction for the past few weeks, but will have some bandwidth now to help push this along.

I just made a new polars-dev branch to keep track of all the work for polars support, I'll be pushing up a few changes by the end of this week with stub modules for all the basic pieces needed, then we can divvy up the work across the schema, components, checks, model, and type engine as described here

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 7, 2023

Okay, to all the folks interested in contributing to this effort: let's kick-off development work to support polars LazyFrames!

Head over here if you just want to start digging into the code
👉 #1373

The PR contains basic functionality and unit tests for supporting pl.LazyFrame validation.

Efforts

The major pieces of work are:

  1. Implement a api and backends module for each polars data structure we want to support. Basic pl.LazyFrame support is here: DataFrameSchema, DataFrameSchemaBackend.
  2. Built-in checks: This would cover the currently available built-in checks . See ge check here.
  3. Pandera type system integration: pandera has a type system for machine- and logical- datatypes (see here for details). This will essentially be a mapping between polars datatypes and the pandera standard data types. Since polars uses Arrow a widely-used data type system, it would be a good time to implement this.
  4. Implement DataFrameModel support for LazyFrames. This would allow for the dataclass-like schema definitions for dataframes.
  5. Consolidate DataFrameSchema API: This is sort of a meta task after 1-4 are more complete, but this would involve attempting to create a common, shared DataFrameSchema definition such that a single schema can validate pandas, pyspark, and polars DataFrames (this is something I can own).

For the rest of 1-4, if anyone's down to contribute to one or more of these efforts please say so in the comments below, I can help point you the right direction and discuss (perhaps in discord if you want to sync up there)

Initial Prototype

The PR referenced above currently contains a basic proof of concept.

For now, you can pipe schemas through a query, which implicitly will call ldf.collect() on all of the metadata and data value checks:

import polars as pl
import pandera.polars as pa
from pandera import Check as C

ldf  = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [0, 1, 2]}).lazy()

schema = pa.DataFrameSchema(
    {
        "string_col": pa.Column(pl.Utf8),
        "int_col": pa.Column(pl.Int64, C.ge(0)),
    }
)

q = ldf.pipe(schema.validate)
df = q.collect()

Raise error:

invalid_ldf  = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [-1, 1, 2]}).lazy()
q = invalid_ldf.pipe(schema.validate, lazy=True)
q.collect()

SchemaErrors: Schema None: A total of 1 errors were found.

shape: (2, 5)
          ┌──────────────┬────────────────┬─────────┬─────────────────────────────┬──────────────┐
          │ failure_case ┆ schema_context ┆ column  ┆ check                       ┆ check_number │
          │ ---          ┆ ---            ┆ ---     ┆ ---                         ┆ ---          │
          │ i64          ┆ str            ┆ str     ┆ str                         ┆ i32          │
          ╞══════════════╪════════════════╪═════════╪═════════════════════════════╪══════════════╡
          │ -1           ┆ Column         ┆ int_col ┆ greater_than_or_equal_to(0) ┆ 0            │
          └──────────────┴────────────────┴─────────┴─────────────────────────────┴──────────────┘

In exploring polar's programming model, there are some cool things we can do with the pandera internals to do things like decoupling validation at query definition time (just checking the column data types) vs query collection time (the data value checks that pandera does). I think this is a great follow-up effort once the basic functionality is implemented.

@FilipAisot
Copy link
Contributor

Happy to be of help @cosmicBboy. Point me in any direction you see fit. We can also discuss it on Discord.

@AndriiG13
Copy link
Contributor

I would definitely need some time to go over the code to get an understanding, but I'm keen to look into 'Built-in checks'!

@ilyanoskov
Copy link

This is very much needed 🙏

@cosmicBboy
Copy link
Collaborator

@ilyanoskov heard! I took a few weeks break from pandera, but am back now and will continue work on this

@ilyanoskov
Copy link

@cosmicBboy thank you very much for all your amazing work with Pandera!

@leycec
Copy link

leycec commented Feb 19, 2024

@beartype lead @leycec here. @beartype has officially supported Pandera for a few release cycles now. We're Team Pandera.

I'm increasingly fielding feature requests like beartype/beartype#329, where users are begging for generic typing of Pandas and Polars DataFrame objects. Polars is rapidly eating Pandas' lunch, thanks to being intrinsically multithreaded and stupidly fast. This is sorta like how JAX rapidly ate NumPy and SciPy's lunch... and for the exact same reason.

tl;dr: When Pandera does this, Pandera wins GitHub. Please win GitHub.

@cosmicBboy
Copy link
Collaborator

alright folks! With the docs update PR #1613 and many bugfixes that were unearthed during the beta, official polars support is ready for prime time 🚀

Gonna cut a 0.19.0 release tonight. I suspect there will be more bugs after this, so please give it a try and report them here!

@blais
Copy link

blais commented May 6, 2024

That's really great!

@yehoshuadimarsky
Copy link

amazing!

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 6, 2024

Here it is: https://github.com/unionai-oss/pandera/releases/tag/v0.19.0 🚀. Again I wanted to thank everyone who contributed PRs, filed bug reports, and provided overall good vibes to supporting this feature 🙂 was super fun for me to learn polars.

Please open bug reports, feature requests, and PRs (especially things that you may want from pandera's existing feature set that isn't currently supported).

@kszlim
Copy link

kszlim commented May 23, 2024

Curious if anyone knows whether https://pandera.readthedocs.io/en/stable/pydantic_integration.html#pydantic-integration is going to be supported for polars and whether there's a tracking issue for that?

@cosmicBboy
Copy link
Collaborator

@kszlim this wasn't in scope for the initial integration, but feel free to make an issue!

@philiporlando
Copy link
Contributor

A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release?

@cosmicBboy
Copy link
Collaborator

A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release?

We should absolutely support polars 1. Can you make an issue outlining what the breaking changes are with respect to the parts of the api used in pandera?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests