-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding Polars as possible data processing framework. #310
Comments
Yes, please do, we will be more than happy to merge the PR. Make sure you update the readme to mention Polars as well :) |
Sorry for the delay, I will work on is. Is it ok to also add an implementation based on arrow2 (under a feature flag)? |
@ritchie46 yes, we want to move to arrow2 in the long run as well :) but it would be good if we can manage it with a feature flag because it's going to take some amount of time before we can migrate kafka-delta-ingest to arrow2. |
If arrow2 feature flag is not possible, we can also maintain an official arrow2 branch |
Great! Polars already runs great on arrow2, and this way, I hope to improve third party support as well. :)
Yes, I don't know how deep embedded arrow is in delta-rs? |
It's not deep, I think the migration should be pretty straight forward. The hard part is isolating it into a feature flag so we can support both arrow and arrow2 in the short term. |
Would love to see this (specifically an arrow2 feature flag), could consolidate a lot of my custom code around parquet partitioning and writing to object storage + gain the benefits of delta lake on top of that. |
I got bounced on the arrow2 migration. It was not so trivial to feature gate indeed. So that would probably need a separate branch? |
@ritchie46 I actually got something mostly working 2 weeks ago, but got distracted with datafusion and roapi so wasn't able to complete the PoC. Let me try to get something pushed out this weekend so we can take a look at it together to see if that's the right path forward. |
@ritchie46 I sent my PoC to #465. |
Wow, you were even able to feature gate it. kudos! 🙌 |
@ritchie46 and @houqp thanks a lot for pushing the PoC. I was wondering if this is feature complete and any example I could refer to get started? |
@mohitreddy1996 - if it is about reading delta tables with polars, the methods read_delta and scan_delta were recently added. Beyond that we are also looking into making internal data-processing more flexible in terms of the processing engine being used. To make this work with polars the arrow2/parquet2 creates need to be used internally, which is possible today. But fully integrating polars internally still requires some major work and we unfortunately cannot give a concrete date if and when that will be supported. |
Are there still considerations to move to using polars for internal operations that require a query execution engine? |
I don't think we have any plans to more forwards on this. I don't think we have too much to gain by moving from using DataFusion to Polars. DataFusion's primary user base is engineers building systems, whereas IMO Polars is much more focused on providing interactive APIs for end users; IMO DataFusion will serve us better in the long term. Plus Polars supports a more narrow range of Arrow data types than DataFusion, so it might present some challenges when integrating with the rest of the Arrow ecosystem. If no one objects, I'd like to close this issue. |
How far were you with the ADBC implementation? I remember that in the design document this will allow the executions to be passed to other ADBC compatible libraries. If that's still the plan, then you can probably close this issue. |
+1 for closing this for now as per @wjones127's comments. |
I see you already support
DataFusion
andRust DataFrame
as well, which both useapache arrow
. It would be nice if we could read/write from/to Polars as well.If you'd be willing to, I could initialize a PR that sets that up.
The text was updated successfully, but these errors were encountered: