-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176
Comments
My personal hope is that arrow2 can be donated to the ASF as a For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position). I see no reason why the stable arrow crate cannot continue to exist and evolve independently of arrow2, assuming that there are contributors motivated to do so. |
For the record, I would be willing to help maintain |
Thank you @alamb for starting and driving this discussion. Great summary on the current community consensus.
Personally, I think the Apache model works better for relatively slow moving monolith projects, while arrow2/parquet2 are fast evolving projects with a vision to be broken up into even smaller modular crates. @alamb has done an exceptional work on driving the arrow-rs releases. But seeing how much effort and time it takes, I would consider it a unnecessary overhead for arrow2 at its current stage. @jorgecarleitao was able to react to user feedbacks fast and release 3-4 new versions in a week for arrow2, this is simply not possible with the Apache Governance model. That said, I think the Apache voting process is very useful when you need high confidence on the quality of every single release and has a large diverse set of PMC members who can participate in the voting in a timely manner. But arrow2 seems still pretty far away from this. @andygrove brought up a good point that it might become an issue for large corporations with restrictive open source contribution guide lines. This is the first time I am aware of this issue, previously I was under the impression that software license is all what matters. On the other hand, I am guessing ASF is not the only governance that's allowed? Perhaps we could help @jorgecarleitao come up with a different compatible governance model for arrow2 until it's ready for the ASF contribution? If Andy wants to contribute to arrow2 now but is blocked by lack of governance, then I would consider this a serious issue that we should address. Otherwise I would optimize for iteration velocity over governance until it becomes a real problem. In short, from what I have seen so far, the upside from adopting the Apache governance model is to unblock potential contributions from big corporations. The downside is it will slow down our iteration process and potentially even disincentivize @jorgecarleitao from actively working on the project. Reading from his past emails, I get the feeling that he did try very hard to pass the IP clearance and donate arrow2 to ASF last year, but got frustrated by the bureaucracy. I am personally much more concerned about latter than the former.
IMHO, this is not important as long as it is well communicated to the users. i.e. be explicit that we are special and please treat our 8.x as 0.x until we say otherwise. But Jorge has a strong opinion on this and want to strictly follow what the rest of the Rust ecosystem does. I also understand where he is coming from and respect his stance on this.
I agree with @andygrove on this. As long as there is community interests in this, we should probably still open arrow-rs up for contributions. This is not the result I want to see, but I have a feeling that this is likely what is going to happen :(
I think this is certainly doable, but then I stand by my previous comment that it won't be a good use of our time unless there is fundamental design tradeoffs in arrow-rs that are not compatible with arrow2's design. Simply replicating the design another project has is not a good reason to start a fork IMHO. I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.
Just throwing out random idea here, one potential variant of option 2 is we use arrow-rs as the place to maintain stable arrow2 branches and let arrow2 iterate as fast as it could without the fear of introducing breaking changes. While the stable branch will cherry-pick compatible commits for a specific 0.x release that we want to maintain for X months. This way, we can still direct all contributions back to arrow2. The downside is I don't know how much interests the community has for a stable API considering we just decided to stop maintaining stable releases for arrow-rs. |
I would just like to get away from this situation where we have two concurrent projects. It is just demoralizing, draining, and to be completely honest it just seems a tad unnecessary. Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation. It overcomes what is otherwise a potentially indefinite political and bureaucratic discussion with pure technical brute force. It is very similar in my mind to option 2, simply changing the merge direction. Ultimately I'm going to end up porting code regardless, I would prefer an outcome that allows me to save others the same effort 😄 |
I totally agree with you @tustvold . We are all engineers looking to solve interesting technical problems together, not playing political games after all. |
I'd like to echo @andygrove 's point here. The only reason we're able to contribute to Arrow/Parquet rust implementation is because it's under the governance of Apache. Otherwise, it'd be very hard for us. |
I agree. I agree that the situation is not productive. I am sorry that I caused frustration to people here.
I am also willing to contribute to such an effort. What do you think about something to the effect of:
This could result in the following changes to arrow-rs:
It would also end the arrow-arrow2 split e.g. removing the un-productive discussions around "which is better", and combine development efforts. Some challenges:
|
That would be fantastic, finding a way to unify our efforts would be amazing
Let me take some time to think about the implications of this and how we might do it in an incremental fashion. Whilst I have no particular affection for |
I am also willing to help on this. And if this would go forward consider me as invested as long term maintainer of the project as well. |
Thank you @jorgecarleitao @ritchie46 and @tustvold
I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case. As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world" -- given the above discussions, it sounds like there are already some good thoughts on this matter so I trust it will be in good hands. I would also be interested in what some of the other recently active contributors / maintainers of arrow-rs such @viirya @askoa and @iajoiner think of these ideas FYI @liukun4515 |
This is fantastic news! Thank you @jorgecarleitao and @ritchie46! |
Agree. :D
incremental fashion 👍 Do let me know if we need a small demo of the current arrow2 design to motivate it or something else. One idea is to wrap arrow2 arrays in arrow-rs newtypes
Thank you 🙇 - I will certainly help in gathering evidence regarding contributors.
@alamb - I agree - I believe ultimately some rewriting will have to happen as we have arrow2 and arrow-rs users with two APIs sometimes with the same name and different signatures. I do think that the way arrow-rs has been re-writing its APIs to be very conductive and productive (i.e. gradually, deprecation warnings, etc.) 🎩 tip to the team here. |
@alamb Thanks for the tag. Though I can't gauge the benefits of arrow2 v arrow-rs, I agree to the notion that the change should happen incrementally to reduce the impact for downstream of both projects. Removing I, as usual, continue to pick up issues based on my availability. |
This is perfectly possible in arrow2 as well. The logical types are stored by the |
Thanks @alamb for pinging me. I've read @jorgecarleitao's comments (and previous comments) when it was posted yesterday. I think my feelings come from two aspects. As a downstream crate user of As a contributor / maintainer of |
What do people think about doing something like #1799 and adding a layout enumeration inside of I'm mainly trying to avoid making changes to |
This comment is not about how we merge arrow-rs and arrow2, just trying to outline the overall issues I observed in arrow-rs that resulted in arrow2. It may serve as an inspiration to find a solution that combines both. 1.
|
What did you think of putting the arrow2 arrays inside of ArrayData, this would achieve all of the above without changing the public array APIs, at least initially? Edit: I've had a poke around in DataFusion and confirmed it makes fairly limited of |
Something like that is definitely the intended end state, but we need some way to get there incrementally 😅 I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate |
Got it. 👍
Interesting! That is not something I had thought about and makes a lot of sense. 👍 From my end, your proposal: |
Wonderful, I'll make a start getting the basic abstractions in place next week, and will then create some tickets for the various migration work that will fall out of this so that we can divide and conquer on this. Hopefully by the time the IP clearance completes we will have gotten arrow-rs into a state where the arrow2 arrays can just be dropped in. |
Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate? This seems like a good separation of concerns to me and will also ensure parallel compilation and probably less complexity. |
I would expect the arrow2 arrays to largely replace what is currently in arrow-data, effectively arrow-data would become the arrow2 array abstractions. Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for? |
I think so. I think it is a combination of arrow-data and arrow-buffer. Maybe I am missing another one. The IO and compute can definitely be done in subcrates. What I also meant with the newtypes are the |
For those following along
Together these work towards allowing us to slowly deprecate and remove the untyped |
@alamb Glad to see arrow-rs and arrow2 could be merged together, I would like to contribute to help these processes. |
What do people think of migrating the arrow2 arrays to be newtypes around the statically typed ArrayData once ready, i.e. much like we are doing for the arrow-rs arrays? This would have a few advantages:
I don't really see a way to avoid doing something similar to this, without a break-the-world arrow2 release, and I think bringing this forward has some compelling advantages? |
Can we take This currently is a small wrapper around a typed pub struct PrimitiveArray<T: NativeType> {
data_type: DataType,
values: Buffer<T>,
validity: Option<Bitmap>,
} As I understand you want to convert it to struct PrimitiveArray<T: NativeType> {
array_data: ArrayData
phantom_data: Phantom<T>
}
enum ArrrayData {
Primitive{..},
Utf8{..},
...
} This would cost an extra check upon random access of the array as the variant of the |
It won't be untyped it would contain |
Ah, I see. In that case it makes sense 👍 . By the look of it, it seems exactly the same as an arrow2 |
We theoretically could, I was proposing not to for the reasons outlined in #1176 (comment) |
I understand. For |
Perhaps you might like to check out #3769 which contains the remaining abstractions, and let me know what you think there? |
As an update here, I am working on creating a summary of what I think the proposal is in this ticket that won't require reading all the context, and then I will create an "epic" style ticket that breaks the work down into more manageable pieces. I expect to be ready in the next few days |
Here is a summary of what I think the plan is. I used Google Slides to make it easier to comment on diagrams: Here is a copy for anyone who prefers PDF: arrow-rs + arrow2.pdf Very much looking forward to feedback. Thank you @tustvold and @ritchie46 who helped create this summary I plan to make the tracking ticket next week |
I have filed jorgecarleitao/arrow2#1429 with the proposal of how this could work and some alternatives. Please provide your feedback there. Unless I hear otherwise I plan to close this particular issue in a few days and we can continue the discussion on jorgecarleitao/arrow2#1429 |
TLDR: please comment on this ticket if you have opinions about if and/or how the community should unite its efforts on a single Rust implementation of Apache Arrow.
Related mailing list thread: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo
There is active discussion and a PR apache/datafusion#1556 about switching the DataFusion project to use the arrow2 Rust implementation of Arrow from @jorgecarleitao. While this DataFusion PR is not yet ready to merge, if DataFusion were to switch to
arrow2
, that leaves a question of what will happen with this (arrow-rs
) code.Since many of the PRs, contributors and maintainers of this (arrow-rs) crate are part of the DataFusion community, I believe if DataFusion switches to
arrow2
, much of the maintenance and extension efforts would followarrow2
arrow2
is largely developed by @jorgecarleitao, who is an Apache Arrow PMC member and committer, but the project itself has not been under the Apache Software Foundation’s governance. Additional background can be found on the mailing list archives and past mailing list threads such as this and thisIt is my opinion that the Rust / Arrow / DataFusion community has general consensus on:
arrow2
are more ergonomicIt is not clear to me if there is a consensus on:
0.x
vs1.x
or later)Possible ideas for a way forward:
arrow2
, making no changes toarrow-rs
. It could be maintained by anyone who wished to contribute,arrow2
code into the arrow-rs repo, with appropriate IP clearance and adopt that as the officially maintained arrow implementation (*)Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example? I would hope we can find a way that is compatible with Apache governance, but I don't think we have a specific proposal yet, and it also depends in large part on what @jorgecarleitao is comfortable with
So, for any users of this crate not also in the DataFusion community, what are your hopes / needs / plans from this crate? How important is the apache governance to you? Please tell us your thoughts!
The text was updated successfully, but these errors were encountered: