Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding Polars as possible data processing framework. #310

Closed
ritchie46 opened this issue Jul 10, 2021 · 17 comments
Closed

Consider adding Polars as possible data processing framework. #310

ritchie46 opened this issue Jul 10, 2021 · 17 comments
Labels
enhancement New feature or request

Comments

@ritchie46
Copy link

I see you already support DataFusion and Rust DataFrame as well, which both use apache arrow. It would be nice if we could read/write from/to Polars as well.

If you'd be willing to, I could initialize a PR that sets that up.

@ritchie46 ritchie46 added the enhancement New feature or request label Jul 10, 2021
@houqp
Copy link
Member

houqp commented Jul 10, 2021

Yes, please do, we will be more than happy to merge the PR. Make sure you update the readme to mention Polars as well :)

@ritchie46
Copy link
Author

Sorry for the delay, I will work on is. Is it ok to also add an implementation based on arrow2 (under a feature flag)?

@houqp
Copy link
Member

houqp commented Sep 12, 2021

@ritchie46 yes, we want to move to arrow2 in the long run as well :) but it would be good if we can manage it with a feature flag because it's going to take some amount of time before we can migrate kafka-delta-ingest to arrow2.

@houqp
Copy link
Member

houqp commented Sep 12, 2021

If arrow2 feature flag is not possible, we can also maintain an official arrow2 branch

@ritchie46
Copy link
Author

@ritchie46 yes, we want to move to arrow2 in the long run as well :)

Great! Polars already runs great on arrow2, and this way, I hope to improve third party support as well. :)

If arrow2 feature flag is not possible, we can also maintain an official arrow2 branch

Yes, I don't know how deep embedded arrow is in delta-rs?

@houqp
Copy link
Member

houqp commented Sep 12, 2021

Yes, I don't know how deep embedded arrow is in delta-rs?

It's not deep, I think the migration should be pretty straight forward. The hard part is isolating it into a feature flag so we can support both arrow and arrow2 in the short term.

@wseaton
Copy link

wseaton commented Sep 13, 2021

Would love to see this (specifically an arrow2 feature flag), could consolidate a lot of my custom code around parquet partitioning and writing to object storage + gain the benefits of delta lake on top of that.

@ritchie46
Copy link
Author

I got bounced on the arrow2 migration. It was not so trivial to feature gate indeed. So that would probably need a separate branch?

@houqp
Copy link
Member

houqp commented Oct 16, 2021

@ritchie46 I actually got something mostly working 2 weeks ago, but got distracted with datafusion and roapi so wasn't able to complete the PoC. Let me try to get something pushed out this weekend so we can take a look at it together to see if that's the right path forward.

@houqp
Copy link
Member

houqp commented Oct 17, 2021

@ritchie46 I sent my PoC to #465.

@ritchie46
Copy link
Author

@ritchie46 I sent my PoC to #465.

Wow, you were even able to feature gate it. kudos! 🙌

@mohitreddy1996
Copy link

@ritchie46 and @houqp thanks a lot for pushing the PoC. I was wondering if this is feature complete and any example I could refer to get started?

@roeap
Copy link
Collaborator

roeap commented Jan 23, 2023

@mohitreddy1996 - if it is about reading delta tables with polars, the methods read_delta and scan_delta were recently added.

Beyond that we are also looking into making internal data-processing more flexible in terms of the processing engine being used. To make this work with polars the arrow2/parquet2 creates need to be used internally, which is possible today. But fully integrating polars internally still requires some major work and we unfortunately cannot give a concrete date if and when that will be supported.

@ion-elgreco
Copy link
Collaborator

@mohitreddy1996 - if it is about reading delta tables with polars, the methods read_delta and scan_delta were recently added.

Beyond that we are also looking into making internal data-processing more flexible in terms of the processing engine being used. To make this work with polars the arrow2/parquet2 creates need to be used internally, which is possible today. But fully integrating polars internally still requires some major work and we unfortunately cannot give a concrete date if and when that will be supported.

Are there still considerations to move to using polars for internal operations that require a query execution engine?

@wjones127
Copy link
Collaborator

Are there still considerations to move to using polars for internal operations that require a query execution engine?

I don't think we have any plans to more forwards on this.

I don't think we have too much to gain by moving from using DataFusion to Polars. DataFusion's primary user base is engineers building systems, whereas IMO Polars is much more focused on providing interactive APIs for end users; IMO DataFusion will serve us better in the long term. Plus Polars supports a more narrow range of Arrow data types than DataFusion, so it might present some challenges when integrating with the rest of the Arrow ecosystem.

If no one objects, I'd like to close this issue.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jul 29, 2023

Are there still considerations to move to using polars for internal operations that require a query execution engine?

I don't think we have any plans to more forwards on this.

I don't think we have too much to gain by moving from using DataFusion to Polars. DataFusion's primary user base is engineers building systems, whereas IMO Polars is much more focused on providing interactive APIs for end users; IMO DataFusion will serve us better in the long term. Plus Polars supports a more narrow range of Arrow data types than DataFusion, so it might present some challenges when integrating with the rest of the Arrow ecosystem.

If no one objects, I'd like to close this issue.

How far were you with the ADBC implementation? I remember that in the design document this will allow the executions to be passed to other ADBC compatible libraries. If that's still the plan, then you can probably close this issue.

@roeap
Copy link
Collaborator

roeap commented Aug 10, 2023

+1 for closing this for now as per @wjones127's comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants