-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for polars
#1064
Comments
Thanks, looking forward to it! |
Me too!! |
Same here, would also be happy to contribute to this one! |
Same, happy to help :) |
Hello, I don't really know if it helps, but I wanted to share this project https://github.com/kolonialno/patito They paired Pydantic and Polars, they are offering some functionalities Pandera offers. I'm not really experienced, but I'm wiling to help too. 😃 |
hi all! so since merging the pandera internals re-write: #913 Support for polars is technically unblocked! I'm still working on the docs for extending pandera with custom schema specs and backends, but basically here's a rough roadmap for supporting polars: Support for
|
I use both polars and pandas, so do not have any thoughts - all are totally acceptable. Good job and cannot wait to use it! |
As a user it would nice to have only one package. And this package would have no strict dependances. So I am clearly in favor of option one here! Otherwise we would end up with |
Thinking loud (and please correct me if I am wrong):
All in all I am in favor of Option 2. My point 2. above is not very important in my opinion, if you really need to work with two different df engines, you can always do it in two separated venvs. |
As a user I think both phases make sense. As I understand, the ibis support would especially be nice for folks who are using different df engines in their project, since they can reuse checks defined in ibis api across the engines. At the same time I think it's good to have a Polars native solution. So I like both, but frankly I'm ignorant to the possible package management implications mentioned by others above. |
I think we should not bring a third dependency to be able to serve Polars and I agree with francesco086 considerations about pandera-core. |
Cool, thanks for the discussion all! So re: the polars-support roadmap, I'll plan on working on the ibis backend integration as a n=2 sample for how well the pandera core/backend abstractions fit into supporting another non-pandas-API framework. Help Needed!Will definitely need some help designing/implementing the polars-native backend: will need to ramp up on the python polars API myself, but would anyone on this thread be willing to help out? Design
ImplementationEventually will also need help implementing:
Please give a 👍 to this comment if you'll be able to help with one or more of the above |
Polars themselves ship data synthesis functions for use with |
One thing that would also be cool is validating polars |
Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!) |
If you are familiar with pandera. Please join our discord. We can open a pandera thread and we can help one snippet at a time. |
@ritchie46 One important aspect to keep in mind is that pandera has schema models, which is much more than column names and types. For example, a pandera schema could describe and check the constraint |
I think this will be very valuable type checking and perhaps other dataframe metadata, though a limitation would be that it wouldn't be able to apply checks on actual values (e.g. Regardless, |
If you look to future-proof To be clear, P.D. |
@kuatroka good feedback! My short-term priority is still to add an That said, I'm all for |
Jumping in a little late here, but as a user of both |
Adding a +1 for Polars schemas! |
@cosmicBboy keen to help 🚀 Sounds like you are prioritising the Or if you are wanting to focus on |
Important Note: I think the DataFrameModel definition should be This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'. Note: Of course, except "edge cases" (indicies, struct types, etc'). |
I think this is a worthy goal, barring a few technical challenges on making this all work nice with multiple dataframe generic types, see this issue. For now, though, each library can get its own |
Haven't read through this whole conversation, but I wanted to drop a link to this DataFrame API standard in case it hadn't been mentioned and, if it hadn't, so that it might help in creating "one |
@rmorshea I've been keeping tabs on that project! How mature would you say it is i.e. is it ready for prime time? |
According to the README it's not out of the draft stage. This issue from 3 weeks ago seems to suggest that things haven't quite crystalized, but it'd probably be best to ask the folks driving the project forward what the status is. If people from Pandera feel like they have a vested interest in a standard like that, I'm sure it would benefit from more contributors. |
Are we starting with this thing? I am ready to do some work! Let's get the ball rolling. |
@FilipAisot yes! I was pulled a different direction for the past few weeks, but will have some bandwidth now to help push this along. I just made a new polars-dev branch to keep track of all the work for polars support, I'll be pushing up a few changes by the end of this week with stub modules for all the basic pieces needed, then we can divvy up the work across the schema, components, checks, model, and type engine as described here |
Okay, to all the folks interested in contributing to this effort: let's kick-off development work to support polars LazyFrames! Head over here if you just want to start digging into the code The PR contains basic functionality and unit tests for supporting EffortsThe major pieces of work are:
For the rest of 1-4, if anyone's down to contribute to one or more of these efforts please say so in the comments below, I can help point you the right direction and discuss (perhaps in discord if you want to sync up there) Initial PrototypeThe PR referenced above currently contains a basic proof of concept. For now, you can pipe schemas through a query, which implicitly will call import polars as pl
import pandera.polars as pa
from pandera import Check as C
ldf = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [0, 1, 2]}).lazy()
schema = pa.DataFrameSchema(
{
"string_col": pa.Column(pl.Utf8),
"int_col": pa.Column(pl.Int64, C.ge(0)),
}
)
q = ldf.pipe(schema.validate)
df = q.collect() Raise error: invalid_ldf = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [-1, 1, 2]}).lazy()
q = invalid_ldf.pipe(schema.validate, lazy=True)
q.collect()
In exploring polar's programming model, there are some cool things we can do with the pandera internals to do things like decoupling validation at query definition time (just checking the column data types) vs query |
Happy to be of help @cosmicBboy. Point me in any direction you see fit. We can also discuss it on Discord. |
I would definitely need some time to go over the code to get an understanding, but I'm keen to look into 'Built-in checks'! |
This is very much needed 🙏 |
@ilyanoskov heard! I took a few weeks break from pandera, but am back now and will continue work on this |
@cosmicBboy thank you very much for all your amazing work with Pandera! |
@beartype lead @leycec here. @beartype has officially supported Pandera for a few release cycles now. We're Team Pandera. I'm increasingly fielding feature requests like beartype/beartype#329, where users are begging for generic typing of Pandas and Polars tl;dr: When Pandera does this, Pandera wins GitHub. Please win GitHub. |
alright folks! With the docs update PR #1613 and many bugfixes that were unearthed during the beta, official polars support is ready for prime time 🚀 Gonna cut a 0.19.0 release tonight. I suspect there will be more bugs after this, so please give it a try and report them here! |
That's really great! |
amazing! |
Here it is: https://github.com/unionai-oss/pandera/releases/tag/v0.19.0 🚀. Again I wanted to thank everyone who contributed PRs, filed bug reports, and provided overall good vibes to supporting this feature 🙂 was super fun for me to learn polars. Please open bug reports, feature requests, and PRs (especially things that you may want from pandera's existing feature set that isn't currently supported). |
Curious if anyone knows whether https://pandera.readthedocs.io/en/stable/pydantic_integration.html#pydantic-integration is going to be supported for polars and whether there's a tracking issue for that? |
@kszlim this wasn't in scope for the initial integration, but feel free to make an issue! |
A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release? |
We should absolutely support polars 1. Can you make an issue outlining what the breaking changes are with respect to the parts of the api used in pandera? |
Hi thanks for the lib! I wonder it can support type checking for
polars
?The text was updated successfully, but these errors were encountered: