Skip to content

[SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer#53043

Closed
Yicong-Huang wants to merge 35 commits intoapache:masterfrom
Yicong-Huang:SPARK-54316/refactor/consolidate-pandas-iter-serializer
Closed

[SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer#53043
Yicong-Huang wants to merge 35 commits intoapache:masterfrom
Yicong-Huang:SPARK-54316/refactor/consolidate-pandas-iter-serializer

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Nov 13, 2025

What changes were proposed in this pull request?

This PR consolidates GroupPandasUDFSerializer to support both SQL_GROUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF, aligning with the design pattern used by GroupArrowUDFSerializer.

Why are the changes needed?

When Iterator[pandas.DataFrame] API was added to groupBy().applyInPandas() in SPARK-53614 (#52716), a new GroupPandasIterUDFSerializer class was created. However, this class is nearly identical to GroupPandasUDFSerializer, differing only in whether batches are processed lazily (iterator mode) or all at once (regular mode).

Does this PR introduce any user-facing change?

No, this is an internal refactoring that maintains backward compatibility. The API behavior remains the same from the user's perspective.

How was this patch tested?

Existing test cases.

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by: Cursor with Claude 4.5 Sonnet

@Yicong-Huang Yicong-Huang force-pushed the SPARK-54316/refactor/consolidate-pandas-iter-serializer branch from 373426a to f8f7201 Compare November 19, 2025 01:02
@Yicong-Huang Yicong-Huang force-pushed the SPARK-54316/refactor/consolidate-pandas-iter-serializer branch from f8f7201 to 0cc9432 Compare November 20, 2025 00:45
@Yicong-Huang Yicong-Huang changed the title [SPARK-54316][PYTHON] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer [SPARK-54316][PYTHON] Align Pandas-related serializers with Arrow-related serializers Nov 24, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54316][PYTHON] Align Pandas-related serializers with Arrow-related serializers [SPARK-54316][CORE][PYTHON][SQL] Align Pandas-related serializers with Arrow-related serializers Nov 25, 2025
batches_gen,
arrow_type,
) in iterator: # tuple constructed in wrap_grouped_*_pandas_udf
# yields df for single UDF or [(df1, type1), (df2, type2), ...] for multiple UDFs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ser dedicated for SQL_GROUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF?
I think they don't support multiple UDFs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I mistakenly thought they should support multiple UDFs, thus the implementation became more complex. I have removed this assumption and simplified the code.

return f(keys, value_series_gen)

elif eval_type in (
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we exclude changes for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF to make the PR more cleaner?
we can do it in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

@Yicong-Huang Yicong-Huang changed the title [SPARK-54316][CORE][PYTHON][SQL] Align Pandas-related serializers with Arrow-related serializers [SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer Nov 26, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer [WIP][SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer Nov 26, 2025

elif eval_type in (
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
PythonEvalType.SQL_WINDOW_AGG_PANDAS_UDF,
Copy link
Contributor

@zhengruifeng zhengruifeng Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need a new mapper for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return type of GroupPandasUDFSerializer is an iterator now. The mapper for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF are expecting a list (so it can use a[0] to access the column). The iterator returned from GroupPandasUDFSerializer has to be converted to a list in their mapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now as we merged #53239, we don't need to change mappers for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF

@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer [SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer Dec 3, 2025
@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants