-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discuss: Move into the Apache ORC PMC and develop as apache/orc-rust
#120
Comments
About the place to move in, we can also consider Arrow or Datafusion, given this repo is deeply involved with Apache Arrow and Datafusion at the API level. Like the fact that |
To be honest I find it a bit surprising to have integration with a query engine in an ORC library. Would it make sense to split the Datafusion-related bits into either their own crate (with all the fun of keeping versions in sync), or move them to Datafusion (like it already does with Parquet)? |
Cross reference: https://lists.apache.org/thread/zrwnhwojf9v5c58hov8hcnpt03ftf3ql |
Hi, I wasn't involved in the datafusion side of development, so I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby, I agree with @progval that it would be better to separate the ORC support component directly into DataFusion, similar to how |
We have three possible options:
|
Chiming in from the Apache ORC community. I'm very excited for the discussion! Sorry that I'm not familiar with rust. For the approach of |
I created a dependency list, and I believe it meets the requirements of the ASF. please checkout the details here: Details
0BSD (1): adler@1.0.2
I believe it should be simple since it's just a |
I find a similar question about the relationship between However if we are going to implement features that are not a strong demand from Datafusion side (like ORC writer apache/orc#1507) or integrate it with other consumers (like Databend databendlabs/databend#8016), having a dedicated repo would both reduce the maintenance burden of Datafusion and make the lib itself easier to use. I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like |
I agree with @waynexia and @progval that the following split makes a lot of sense to me
|
FWIW I think this was partly an artifiact of history:
Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there |
cc @dongjoon-hyun @guiyanakuang @williamhyun @omalley from Apache ORC PMC |
Agreed. I have thought about this before but haven't taken any action yet. I mean, it looks appealing to have I'm starting this thread because I believe it's beneficial for orc-rs to build a community by developing at upstream, but it doesn't seem applicable to parquet-rs at the moment. |
Previously we discussed split parquet-cpp out of arrow-c++. However the dependency would be weird since there're:
|
I believe the situation is different in |
Thank you for the discussion. It looks like we can move forward! I think we can:
|
I'll start preparing a PR to split the current repo. Do you have something like guidance for IP clearance? I have attended it before but have not prepared one. |
Thanks!
I think we can follow https://incubator.apache.org/ip-clearance/ Here's an example from apache/arrow-rs#2096. We can reach out to @alamb if we encounter any problems. |
Hi @progval @klangner, as part of the IP clearance process, could you please submit an ICLA (Individual Contributor Licence Agreement) following the follow the instructions at https://www.apache.org/licenses/contributor-agreements.html if you do not already have one on file? Thanks in advance for helping with this! If you already have filed one, please let me know the email address associated with your account. |
I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality). The way I see it, the primary focus of this repository is to serve as an integration with DataFusion to allow querying ORC files. Naturally this required first implementing a layer to read ORC files to Arrow, before then being able to integrate into DataFusion itself (similar to how there is parquet-rs, then the actual parquet integration code in DataFusion). I can see the merit to splitting up this repository, but perhaps still be too early to do so? One benefit of having both the integration with Arrow and integration with DataFusion in a single repository is that it allows easier development, as these interfaces will be interacting with each other. Splitting across different repositories might make it harder to experiment with the interface for each respective integration, which can slow down development. Furthermore, I don't think there were any immediate plans to develop a native ORC interface; that is, being able to read ORC in Rust without reading it to Arrow (similar to how parquet-rs has a low level column reader/writer API). From my point of view then, it might seem odd to donate a primarily Arrow <-> ORC interface library to ORC. |
I think I have already signed it some time ago while doing some other work. |
Thank you very much for your contribution!
From my perspective (as a committer on some Apache projects), it's already late for us to do so. Developing at upstream can create a solid foundation for our entire community to build upon, making it easier for those interested in using ORC in Rust to find this project. Additionally, we can garner more support from the ORC community. Building a strong community is the key to our success. For example, we started iceberg-rust as a very basic project that could only read tables, but it has now grown to 53 contributors with full catalog support. By donating this to ORC, I expect to build a community around it, similar to what we've done with iceberg-rust. Therefore, instead of waiting for our project to mature and gain full support, I prefer to start and attract more people to join now. I believe it's fine for us to use the existing
I agree, but it depends on the community's feature requests. I would be happy to work with the community if someone wants to collaborate on this. |
Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development. I do not know how likely that is at this point, though |
I have three such cases on my tables:
|
From paimon-rust may also need a native ORC support but not datafusion |
This case is interesting since paimon-rust will need datafusion but not require orc with datafusion. Paimon requires orc to read data but provides datafusion integration on its own. |
Is there still the parquet-rust project? |
I do not know what https://parquet.apache.org/docs/contribution-guidelines/sub-projects/ has a list of open source rust implementations
|
Hi, this project does not belong to your employer (please correct me if I'm wrong). This donation will be sent from |
Hi, @alamb. This reminds me that we should establish the CLA for all projects in the |
I think the idea with datafusion-contrib is to minimize process overhead (such as apache CLAs) and mostly serve as a very disparate set of crates. As they mature, we can then apply more process (as we are doing in this case) The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions) |
Understood, thank you. This design makes sense to me. |
I see mention of not needing the DataFusion integration code as motivation, but could this be addressed by splitting the current project to have two subcrates, one for pure Arrow-ORC and the other for DataFusion integration? I wanted to do this initially but kept DataFusion as a feature to make it easier to develop with, especially since the DataFusion integration code is currently quite small (though I guess the dependency footprint isn't 😅 ) |
Hi, @progval. Sorry for the interruption. I wanted to check if it works well. |
https://www.apache.org/licenses/contributor-agreements.html says CCLAs are for "For a corporation that assigns employees to work on an Apache project", and I was an employee assigned to work on the project. Either way, I need my ex-employer's permission for the ICLA
My ex-employer's staff came back from summer vacation this week, and they are about to start processing my request. Current employer won't be an issue. Sorry for the delay. |
I just submitted my ICLA, and a CCLA from my ex-employer who owns all my past contributions.
|
Wow, really great! cc @waynexia, are you still interested in working on this? The action items are:
Please let know me if you need any hand. |
cc @alamb, would you like to help create |
That's great news!
I'm resuming the IP clearance procedure, and will update any future problems to this thread. Edit: as well as the code split things |
https://github.com/datafusion-contrib/orc-rs is setup with @Xuanwo and @waynexia as admins |
IP Clearance file is updated to https://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/orc-rs.xml |
Cross-referencing the thread from dev@orc.a.o https://lists.apache.org/thread/l6b0hsq29rr6to96tqmjpxt2mwz4nzbc |
Hi @alamb, apologies for my mistake. The repository should be named I'm going to rename it now, just FYI. |
Donation PR: datafusion-contrib/orc-rust#1 |
Merged! We can remove duplicate code in this repo and transfer issues to new repositories now. |
Great 🎉 There is a temporary branch
Context: from the above discussion this repo will only focus on DataFusion-ORC data source integration in the future. |
I personally feel that it's better to send a PR to remove donated code. Having a repo where the main branch is not |
Hi, @waynexia, I believe we are ok to implement this change. |
Thanks for reminding 🙈 I'll file a PR to the current main to remove ORC implementation and use the released upstream instead tonight. |
Progress update: the entire process is almost done if I don't miss anything (code split, ip clearance, transferring issue & tag etc). One last remaining thing is waiting for the ORC PMC to accept https://github.com/datafusion-contrib/orc-rust |
Thank you! cc @wgtmac, would you like to start a VOTE for this? |
Thanks for the heads up! @Xuanwo Could you please provide the list of committers that will join the Apache ORC PMC? I will include this in the vote as well. |
Hello, everyone. I am initiating this discussion to explore the possibility of moving into the Apache ORC PMC and developing
apache/orc-rust
.By developing
apache/orc-rust
, we will establish this implementation as the official Rust version of ORC, thereby creating a larger and more cohesive community for those interested in a Rust ORC implementation. This will make it much easier for us to build a community around this project.What are your thoughts? I plan to discuss this with the orc community if contributors are satisfied with it.
cc @Jefffrey @WenyXu @progval @waynexia @klangner @alamb @v0y4g3r @youngsofun @harveyyue
The text was updated successfully, but these errors were encountered: