-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Summer of Code - Ideas and Coordination #1032
Comments
@Spaarsh @sidshehria I think this is a good place to coordinate specific ideas. I'm also seeking out an additional person who would be willing to mentor. We might also consider the |
@timsaucer can we resolve these issues right now which are given above? |
Yes, of course. We're always interested in contribution to our open issues! |
@timsaucer Some key areas that could make a big impact: Would love to hear the community’s thoughts on what improvements would be most valuable! |
These sound like good goals. For the last one, enhanced documentation, we can correlate that to issue #842 and likely come up with a good plan of action. For the first three, did you have concrete ideas either about solutions or more specifics of the problem? |
@timsaucer Yes, kind of some solutions I have in my mind Kindly review them, 1. Higher-Level Abstractions:
2. Better Integration with Pandas/Polars:
3. Performance Optimizations in the FFI Layer:
|
For the high level abstractions, I believe these are already met. The DataFrame API is available and widely used (in fact, its the only way I personally use it). The common operations online documentation has a handful of sub-pages that describe usage of the API, as well as in the API reference. DataFusion does already use a lazy evaluation mode. For the integration with Pandas and Polars, support for this exists and is described in the data sources page. For the efficient batch processing leveraging Arrow's memory format, that is how DataFusion operates currently. For the PyO3 interface, I'm not familiar with what optimizations you have in mind to reduce overhead. I'd be curious where you think we have issues currently. I'd also love to hear if you have ideas about optimizing the data movement between Python and Rust. This is a difficult problem, but we do already leverage the pyarrow FFI interface to avoid many of the data translation inefficiencies. Parallel execution is also already supported, but there are additional efforts like |
@timsaucer I understand the explanation on the DataFrame API, lazy mode of evaluation, and Pandas/Polars integration better. I will refer to the common operations documentation and the data sources page more extensively to grasp the current implementation in detail. To optimize PyO3 overhead, I will look into:
For parallel execution and distributed processing, I'll look into datafusion-ray and ballista to understand their current development and potential contribution areas. Would love any pointers on known performance pain points in the PyO3 interface that could be valuable to address! ? Thanks again for the guidance! |
Overview
Multiple people have expressed interest in working with DataFusion Python for a Google Summer of Code project. We are excited to have this level of interest and we always welcome contributions from the community.
The goal of this issue is to have a coordination point for people interested in working on and mentoring this project for GSoC. We would like to collect specific ideas for projects. This is to help those applicants pick something interesting that is achievable in the time frame of GSoC.
This is a subproject under the greater Apache DataFusion GSoC: apache/datafusion#14478
Related Issues
These issues have all been identified as candidates for inclusion in a GSoC project. Feel free to suggest others. Ideally we would combine some of these issues into a larger unified goal.
col
#754Additionally, I know there is interest in working on integrations with Iceberg and other Table Providers. That work is not directly within this repository, but it does fall within the purview of the GSoC project in my opinion.
The text was updated successfully, but these errors were encountered: