Google Summer of Code - Ideas and Coordination #1032

timsaucer · 2025-02-20T12:01:41Z

Overview

Multiple people have expressed interest in working with DataFusion Python for a Google Summer of Code project. We are excited to have this level of interest and we always welcome contributions from the community.

The goal of this issue is to have a coordination point for people interested in working on and mentoring this project for GSoC. We would like to collect specific ideas for projects. This is to help those applicants pick something interesting that is achievable in the time frame of GSoC.

This is a subproject under the greater Apache DataFusion GSoC: apache/datafusion#14478

Related Issues

These issues have all been identified as candidates for inclusion in a GSoC project. Feel free to suggest others. Ideally we would combine some of these issues into a larger unified goal.

Additionally, I know there is interest in working on integrations with Iceberg and other Table Providers. That work is not directly within this repository, but it does fall within the purview of the GSoC project in my opinion.

The text was updated successfully, but these errors were encountered:

timsaucer · 2025-02-20T12:08:42Z

@Spaarsh @sidshehria I think this is a good place to coordinate specific ideas. I'm also seeking out an additional person who would be willing to mentor. We might also consider the datafusion-ray project which is tightly coupled to datafusion-python

sidshehria · 2025-02-20T14:06:30Z

@timsaucer can we resolve these issues right now which are given above?

timsaucer · 2025-02-20T14:12:12Z

Yes, of course. We're always interested in contribution to our open issues!

sidshehria · 2025-02-21T08:29:27Z

@timsaucer
I believe improving Python bindings in Apache DataFusion would be a great step forward in making it more accessible to data engineers and analysts. Expanding the Python API surface and improving interoperability with libraries like Pandas and Polars would significantly enhance usability.

Some key areas that could make a big impact:
✅ Higher-level abstractions to make it more intuitive for end users.
✅ Better integration with Pandas/Polars to streamline workflows.
✅ Performance optimizations in the FFI layer for seamless execution.
✅ Enhanced documentation and examples to drive adoption.

Would love to hear the community’s thoughts on what improvements would be most valuable!

timsaucer · 2025-02-21T11:39:17Z

These sound like good goals. For the last one, enhanced documentation, we can correlate that to issue #842 and likely come up with a good plan of action.

For the first three, did you have concrete ideas either about solutions or more specifics of the problem?

sidshehria · 2025-02-21T16:24:11Z

@timsaucer Yes, kind of some solutions I have in my mind Kindly review them,

1. Higher-Level Abstractions:

Introduce a DataFrame-like API that feels more intuitive, similar to Pandas/Polars.
Expose simplified query execution methods, reducing the need for manual SQL queries.
Provide a lazy evaluation mode to optimize performance in large-scale data operations.

2. Better Integration with Pandas/Polars:

Implement direct conversion utilities between DataFusion and Pandas/Polars DataFrames.
Improve data type compatibility to ensure smooth interoperability.
Support efficient batch processing, leveraging Arrow’s memory format.

3. Performance Optimizations in the FFI Layer:

Reduce overhead in Python-Rust interop using PyO3/maturin optimizations.
Optimize data movement between Python and Rust to minimize serialization costs.
Explore parallel execution to enhance computation speed for large datasets.

timsaucer · 2025-02-21T18:15:40Z

For the high level abstractions, I believe these are already met. The DataFrame API is available and widely used (in fact, its the only way I personally use it). The common operations online documentation has a handful of sub-pages that describe usage of the API, as well as in the API reference.

DataFusion does already use a lazy evaluation mode.

For the integration with Pandas and Polars, support for this exists and is described in the data sources page.

For the efficient batch processing leveraging Arrow's memory format, that is how DataFusion operates currently.

For the PyO3 interface, I'm not familiar with what optimizations you have in mind to reduce overhead. I'd be curious where you think we have issues currently. I'd also love to hear if you have ideas about optimizing the data movement between Python and Rust. This is a difficult problem, but we do already leverage the pyarrow FFI interface to avoid many of the data translation inefficiencies.

Parallel execution is also already supported, but there are additional efforts like datafusion-ray and ballista where we push the envelope much further by going into distributed processing. Those are under heavy/active development right now and also a very good place to make contributions.

sidshehria · 2025-02-21T18:25:12Z

@timsaucer
Thanks for the clarity!

I understand the explanation on the DataFrame API, lazy mode of evaluation, and Pandas/Polars integration better. I will refer to the common operations documentation and the data sources page more extensively to grasp the current implementation in detail.

To optimize PyO3 overhead, I will look into:

Profiling the FFI interface to understand Python-Rust data movement bottlenecks.
Researching zero-copy data transfer options to reduce overhead further.
Checking if alternative serialization methods can improve efficiency over pyarrow's current approach.

For parallel execution and distributed processing, I'll look into datafusion-ray and ballista to understand their current development and potential contribution areas.

Would love any pointers on known performance pain points in the PyO3 interface that could be valuable to address! ?

Thanks again for the guidance!

timsaucer mentioned this issue Feb 20, 2025

Project Ideas for GSoC 2025 (Google Summer of Code) apache/datafusion#14478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Summer of Code - Ideas and Coordination #1032

Google Summer of Code - Ideas and Coordination #1032

timsaucer commented Feb 20, 2025 •

edited

Loading

timsaucer commented Feb 20, 2025

sidshehria commented Feb 20, 2025

timsaucer commented Feb 20, 2025

sidshehria commented Feb 21, 2025

timsaucer commented Feb 21, 2025

sidshehria commented Feb 21, 2025

timsaucer commented Feb 21, 2025

sidshehria commented Feb 21, 2025

Google Summer of Code - Ideas and Coordination #1032

Google Summer of Code - Ideas and Coordination #1032

Comments

timsaucer commented Feb 20, 2025 • edited Loading

Overview

Related Issues

timsaucer commented Feb 20, 2025

sidshehria commented Feb 20, 2025

timsaucer commented Feb 20, 2025

sidshehria commented Feb 21, 2025

timsaucer commented Feb 21, 2025

sidshehria commented Feb 21, 2025

timsaucer commented Feb 21, 2025

sidshehria commented Feb 21, 2025

timsaucer commented Feb 20, 2025 •

edited

Loading