Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Summer of Code - Ideas and Coordination #1032

Open
timsaucer opened this issue Feb 20, 2025 · 8 comments
Open

Google Summer of Code - Ideas and Coordination #1032

timsaucer opened this issue Feb 20, 2025 · 8 comments

Comments

@timsaucer
Copy link
Contributor

timsaucer commented Feb 20, 2025

Overview

Multiple people have expressed interest in working with DataFusion Python for a Google Summer of Code project. We are excited to have this level of interest and we always welcome contributions from the community.

The goal of this issue is to have a coordination point for people interested in working on and mentoring this project for GSoC. We would like to collect specific ideas for projects. This is to help those applicants pick something interesting that is achievable in the time frame of GSoC.

This is a subproject under the greater Apache DataFusion GSoC: apache/datafusion#14478

Related Issues

These issues have all been identified as candidates for inclusion in a GSoC project. Feel free to suggest others. Ideally we would combine some of these issues into a larger unified goal.

Additionally, I know there is interest in working on integrations with Iceberg and other Table Providers. That work is not directly within this repository, but it does fall within the purview of the GSoC project in my opinion.

@timsaucer
Copy link
Contributor Author

@Spaarsh @sidshehria I think this is a good place to coordinate specific ideas. I'm also seeking out an additional person who would be willing to mentor. We might also consider the datafusion-ray project which is tightly coupled to datafusion-python

@sidshehria
Copy link

@timsaucer can we resolve these issues right now which are given above?

@timsaucer
Copy link
Contributor Author

Yes, of course. We're always interested in contribution to our open issues!

@sidshehria
Copy link

@timsaucer
I believe improving Python bindings in Apache DataFusion would be a great step forward in making it more accessible to data engineers and analysts. Expanding the Python API surface and improving interoperability with libraries like Pandas and Polars would significantly enhance usability.

Some key areas that could make a big impact:
✅ Higher-level abstractions to make it more intuitive for end users.
✅ Better integration with Pandas/Polars to streamline workflows.
✅ Performance optimizations in the FFI layer for seamless execution.
✅ Enhanced documentation and examples to drive adoption.

Would love to hear the community’s thoughts on what improvements would be most valuable!

@timsaucer
Copy link
Contributor Author

These sound like good goals. For the last one, enhanced documentation, we can correlate that to issue #842 and likely come up with a good plan of action.

For the first three, did you have concrete ideas either about solutions or more specifics of the problem?

@sidshehria
Copy link

@timsaucer Yes, kind of some solutions I have in my mind Kindly review them,

1. Higher-Level Abstractions:

  • Introduce a DataFrame-like API that feels more intuitive, similar to Pandas/Polars.
  • Expose simplified query execution methods, reducing the need for manual SQL queries.
  • Provide a lazy evaluation mode to optimize performance in large-scale data operations.

2. Better Integration with Pandas/Polars:

  • Implement direct conversion utilities between DataFusion and Pandas/Polars DataFrames.
  • Improve data type compatibility to ensure smooth interoperability.
  • Support efficient batch processing, leveraging Arrow’s memory format.

3. Performance Optimizations in the FFI Layer:

  • Reduce overhead in Python-Rust interop using PyO3/maturin optimizations.
  • Optimize data movement between Python and Rust to minimize serialization costs.
  • Explore parallel execution to enhance computation speed for large datasets.

@timsaucer
Copy link
Contributor Author

For the high level abstractions, I believe these are already met. The DataFrame API is available and widely used (in fact, its the only way I personally use it). The common operations online documentation has a handful of sub-pages that describe usage of the API, as well as in the API reference.

DataFusion does already use a lazy evaluation mode.

For the integration with Pandas and Polars, support for this exists and is described in the data sources page.

For the efficient batch processing leveraging Arrow's memory format, that is how DataFusion operates currently.

For the PyO3 interface, I'm not familiar with what optimizations you have in mind to reduce overhead. I'd be curious where you think we have issues currently. I'd also love to hear if you have ideas about optimizing the data movement between Python and Rust. This is a difficult problem, but we do already leverage the pyarrow FFI interface to avoid many of the data translation inefficiencies.

Parallel execution is also already supported, but there are additional efforts like datafusion-ray and ballista where we push the envelope much further by going into distributed processing. Those are under heavy/active development right now and also a very good place to make contributions.

@sidshehria
Copy link

@timsaucer
Thanks for the clarity!

I understand the explanation on the DataFrame API, lazy mode of evaluation, and Pandas/Polars integration better. I will refer to the common operations documentation and the data sources page more extensively to grasp the current implementation in detail.

To optimize PyO3 overhead, I will look into:

  1. Profiling the FFI interface to understand Python-Rust data movement bottlenecks.
  2. Researching zero-copy data transfer options to reduce overhead further.
  3. Checking if alternative serialization methods can improve efficiency over pyarrow's current approach.

For parallel execution and distributed processing, I'll look into datafusion-ray and ballista to understand their current development and potential contribution areas.

Would love any pointers on known performance pain points in the PyO3 interface that could be valuable to address! ?

Thanks again for the guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants