Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Dec 11, 2025

What changes were proposed in this pull request?

Add a config spark.sql.execution.pandas.backend, when set pyarrow, provide pyarrow-backed Pandas Series to UDF execution.

1, pre-execution arrow->pandas conversion: convert to pyarrow-backed pandas data, will be zero copy;
2, UDF execution: if the operation is compatible with pyarrow (e.g. ser -> ser + 1), then the computation result will also be pyarrow-backed; otherwise, it will fallback to a numpy-backed one (e.g. ser -> ser.apply always generate a numpy-backed ser);
3, post-execution pandas->arrow conversion: if the computation result is a pyarrow-backed instance, then the conversion will be zero copy;

Why are the changes needed?

Pandas is moving forward to pyarrow-backend, in the coming 3.0 release, it starts to use arrow-backed string type. see https://pandas.pydata.org/docs/dev/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

Starting with pandas 3.0, a dedicated string data type is enabled by default (backed by PyArrow under the hood, if installed, otherwise falling back to being backed by NumPy object-dtype).

Does this PR introduce any user-facing change?

the config is disabled by default, but the behavior change is unknown right now

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng changed the title [WIP] Zero Copy Pandas UDF [WIP][PYTHON] Zero Copy Pandas UDF Dec 11, 2025
@holdenk
Copy link
Contributor

holdenk commented Dec 11, 2025

"config being able by default" does not mean no user facing change imho.

@zhengruifeng
Copy link
Contributor Author

@holdenk Thank, updated the description.
This PR is just for investigation, the behavior change is unknown right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants