-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating arrow2 branch #2709
Comments
Would it make sense to consider having a separate, optional, |
cc @Igosuki as well who may have some thoughts |
Hello, sorry for my absence. Maintaining full compatibility requires constant updates, and updating the arrow2 branch is often in the hundred files because of the API surface conflict. So adding a new crate where only arrow2 itself needs to be updated is a good idea in my opinion. |
+1 on @andygrove 's idea for isolating changes to a self-contained crate. @dbr if you are interested in that, please feel free to start that project at the datafusion contrib github org. If you are only interested in updating arrow2 version in the current arrow2 branch, then you can just send a PR to the arrow2 branch :) |
Should we have a discussion/vote about whether this should be in |
@alamb @jorgecarleitao Do you have opinions on this? |
I do not have any preference, but I also don't really spend any effort on arrow2 or the arrow2 branch in datafusion -- perhaps @Igosuki or others who have expended significant effort would be better to weigh on |
@andygrove, thanks for the ping! I do not have a preference - I am very happy to support anyone that wishes to use arrow2/parquet2, and I am very thankful for everyone that has been maintaining the arrow2 branch. I unfortunately do not have the time left to spearhead it. @dbr I would be very happy to support you in the migration - let me know how I can help and feel free to assign tasks directly. Arrow2's core API is essentially stable and that we we usually have small breaking changes in IO, so most of the changes should be on the datafusion's side. For full transparency: imo the primary question here is one of control - although we demonstrated that arrow2 had superior support, safety, performance, documentation, and UX, I sensed that the main requirement was that arrow2 was donated to Apache, which for me was unacceptable. This demotivated me from contributing here. I understand that companies that have a strong dependency on DataFusion need control over its core dependency. For me it demos that we are a bit past the idea that people in Apache are individual contributors. The core development of Polars and Databend have been more pragmatic and I have thus been more aware of the requests there. However, if there is anything needed, just drop an issue and I will gladly take it, irrespectively of where the development happens (here or in a separate repo / org). I continue to believe that DataFusion would benefit immensely from arrow2's current and future capabilities - arrow2 is the backend of likely the fastest in-node dataframe APIs out there and just this week we exposed a new API that enable compute to be 2x-5x faster ;) |
I believe there are differing opinions on some of these items I do think there is broad agreement that the governance model of arrow2 (benign dictator) is not suitable for all users for a variety of reasons. |
The main issue a few months ago was that the datafusion codebase was not split like it is now, so it created conflicts in the various places where it wasn't possible to confine differences between arrow and arrow2::io to extension traits. So my question is, can a contrib crate exist since arrow2 is essentially required at the core ? |
I agree. Although there is some great work happening in arrow2, I am not motivated (or even permitted) to contribute there. I am heavily invested in working with the Apache Arrow community to continue to improve things here together. The Apache governance model is not perfect but it is working well enough from my point of view. I am a big believer in community over code. |
Regardless of where the code lives, I would view us as having two main options:
|
One approach that @jimexist started with, that I thought was promising was to use For example, I would be quite supportive of such a strategy and would be a nice way to let people use which every library they wanted |
Bonus points if it could one day support a GPU implementation of Arrow as well 😉 |
This indirection might also help to decouple downstream's arrow version with DF's. |
The current arrow2 branch was not created with the intention to keep it as a long running branch, instead, the original plan was to use it as a tool to validate (#1652) whether we will be able to get a significant performance boost by switching to arrow2, and if so, convince the rest of the community to make that jump :) Unfortunately, due to personal reasons, I don't have the time to keep pushing this forward anymore. I agree with everyone that coming up with a zero cost arrow trait abstraction would be the best path forward, especially considering the possibility of adding arrow-gpu in the future! |
@alamb AFAIK governance has not been a factor when considering dependencies in DataFusion, but maybe I missed it when that decision taken. Specifically, going through some of its dependencies:
"arrow2 - unclear governance" is imo a more adequate characterization here - I have been contributing the most to it, so it is natural that I have been taking most of the decisions. Ritchie from Polars, @Dandandan, @alamb and others have full write permissions to the repo; @houqp and Ritchie have permissions to publish to crates.io. I would be happy to set/modify a governance to arrow2 to accommodate this community's needs, but I would need to know what is the exact requirement, because being part of Apache does not seem to be one today. Wrt to community over code, I certainly agree with you, @andygrove - I just believe that the Apache Foundation does not own the monopoly of that practice - there are many projects and organizations that fulfill that without forcing projects to a structure like Apache forces, in particular how contributions to some part of the project induce control over the whole project (being PMC). I agree that going through a trait abstraction is a way. Creating an "arrow2arrow" as suggested in jorgecarleitao/arrow2#629 would also be a way. I would be willing to create and maintain such a crate if that would help this community, but again, we would end up with the same question - it would not be Apache governed, can it be a dependency? |
To conflate arrow which is core to both the in-memory layout and query computation, with something fairly self-contained like It is fairly common for work in arrow-datafusion to build upon/require work in arrow-rs, whereas the same is not true of those listed dependencies. Given we are still talking about this I think we have clearly demonstrated arrow is not a trivial dependency, and therefore warrants some additional care and attention...
If practicable this sounds like a good idea to me. It would also fit with my ongoing efforts, articulated here, to gradually bring incubated ideas from arrow2 to arrow-rs. I'm actually pretty happy with the progress that has been made so far, and having an easy way to compare the two implementations I think would prove insightful. In particular I have been inspired by the way the arrow-rs community has worked together to drive progress, through code contributions or otherwise. Yes arrow-rs still has many warts, and is often not particularly pleasant to work on, but there is an active community of people working to make it better in whatever ways they can, and that to me is what open source is all about. For some the governance question is one of legal practicalities, but at least for me it is about having a friendly community working together to deliver great software. |
We have proposed donating |
I have filed an issue for adding GPU support: apache/arrow-rs#1966 The proposed As we think about abstractions it would be good to take this into account. |
Lets move back to upgrading arrow2 version in arrow2 branch. I'm quite interested in that, but I found that some unit tests are currently failing:
I'd like to first upgrade arrow2 version and then fix these failing tests. |
Given the current discussion about arrow/arrow2 unification (see apache/arrow-rs#1176 (comment)) I think this issue may no longer be relevant. If no one objects, I'll plan to close it in the next few days |
Hearing no other comments here, I am closing this one. Feel free to chime in if you disagree or have other comments to add |
The arrow2 branch is currently using
v0.10.0
- I'd like to update it to work with v0.12.0. Is there anything I should know before starting this?I ask as last time I made a PR for the branch, there was an impending rework of that branch, and I'm not sure what the state of things are currently!
The text was updated successfully, but these errors were encountered: