[SPARK-54316][CORE][PYTHON][SQL] Consolidate `GroupPandasIterUDFSerializer` with `GroupPandasUDFSerializer` by Yicong-Huang · Pull Request #53043 · apache/spark

Yicong-Huang · 2025-11-13T19:57:56Z

What changes were proposed in this pull request?

This PR consolidates GroupPandasUDFSerializer to support both SQL_GROUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF, aligning with the design pattern used by GroupArrowUDFSerializer.

Why are the changes needed?

When Iterator[pandas.DataFrame] API was added to groupBy().applyInPandas() in SPARK-53614 (#52716), a new GroupPandasIterUDFSerializer class was created. However, this class is nearly identical to GroupPandasUDFSerializer, differing only in whether batches are processed lazily (iterator mode) or all at once (regular mode).

Does this PR introduce any user-facing change?

No, this is an internal refactoring that maintains backward compatibility. The API behavior remains the same from the user's perspective.

How was this patch tested?

Existing test cases.

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by: Cursor with Claude 4.5 Sonnet

python/pyspark/sql/pandas/serializers.py

zhengruifeng · 2025-11-25T02:56:09Z

python/pyspark/sql/pandas/serializers.py

+                batches_gen,
+                arrow_type,
+            ) in iterator:  # tuple constructed in wrap_grouped_*_pandas_udf
+                # yields df for single UDF or [(df1, type1), (df2, type2), ...] for multiple UDFs


is this ser dedicated for SQL_GROUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF?
I think they don't support multiple UDFs?

Ok I mistakenly thought they should support multiple UDFs, thus the implementation became more complex. I have removed this assumption and simplified the code.

zhengruifeng · 2025-11-25T02:57:05Z

python/pyspark/worker.py

            return f(keys, value_series_gen)

+    elif eval_type in (
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,


can we exclude changes for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF to make the PR more cleaner?
we can do it in a separate PR

…L_GROUPED_MAP_PANDAS_ITER_UDF

…ndas-iter-serializer

zhengruifeng · 2025-11-26T03:49:36Z

python/pyspark/worker.py


+    elif eval_type in (
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
+        PythonEvalType.SQL_WINDOW_AGG_PANDAS_UDF,


why we need a new mapper for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF?

the return type of GroupPandasUDFSerializer is an iterator now. The mapper for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF are expecting a list (so it can use a[0] to access the column). The iterator returned from GroupPandasUDFSerializer has to be converted to a list in their mapper.

Now as we merged #53239, we don't need to change mappers for SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF

…er-serializer

…OUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF

…er-serializer # Conflicts: # python/pyspark/sql/pandas/serializers.py # python/pyspark/worker.py

zhengruifeng · 2025-12-04T01:16:29Z

merged to master

github-actions bot added SQL CORE PYTHON labels Nov 13, 2025

zhengruifeng reviewed Nov 14, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

Yicong-Huang requested a review from zhengruifeng November 14, 2025 22:11

Yicong-Huang force-pushed the SPARK-54316/refactor/consolidate-pandas-iter-serializer branch from 373426a to f8f7201 Compare November 19, 2025 01:02

Yicong-Huang added 8 commits November 19, 2025 11:17

refactor: merge GroupPandasIterUDFSerializer to GroupPandasUDFSerializer

d1e4f67

fix: format

6895207

fix: format

c6ad502

feat: redesign the wrapper and serializer

35ce001

fix: format

37bcda4

fix: handle comments

f216df3

chore: clean up

6410866

fix: change serializer input to list

0cc9432

Yicong-Huang force-pushed the SPARK-54316/refactor/consolidate-pandas-iter-serializer branch from f8f7201 to 0cc9432 Compare November 20, 2025 00:45

Yicong-Huang added 7 commits November 20, 2025 12:29

fix: align with GroupArrowUDFSerializer

b3dfd06

fix: format

c7317cf

fix: wrong indentation

5eab53c

fix: move order

3a45f53

fix: use two serializers

91fb21b

fix: format

89faad8

fix: format

ccaa9b4

Yicong-Huang changed the title ~~[SPARK-54316][PYTHON] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer~~ [SPARK-54316][PYTHON] Align Pandas-related serializers with Arrow-related serializers Nov 24, 2025

fix: format

cf96d3b

Yicong-Huang changed the title ~~[SPARK-54316][PYTHON] Align Pandas-related serializers with Arrow-related serializers~~ [SPARK-54316][CORE][PYTHON][SQL] Align Pandas-related serializers with Arrow-related serializers Nov 25, 2025

zhengruifeng requested changes Nov 25, 2025

View reviewed changes

Yicong-Huang added 3 commits November 25, 2025 15:20

fix: remove changes with multi UDF, SQL_GROUPED_MAP_PANDAS_UDF and SQ…

178fbce

…L_GROUPED_MAP_PANDAS_ITER_UDF

fix: remove ArrowStreamAggPandasUDFSerializer

57fe551

fix: remove mapper block

ff64faa

Yicong-Huang changed the title ~~[SPARK-54316][CORE][PYTHON][SQL] Align Pandas-related serializers with Arrow-related serializers~~ [SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer Nov 26, 2025

Yicong-Huang changed the title ~~[SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer~~ [WIP][SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer Nov 26, 2025

Yicong-Huang added 2 commits November 25, 2025 17:18

fix: another implementation to seperate list and iterator expectation

ea9dac1

Merge branch 'apache:master' into SPARK-54316/refactor/consolidate-pa…

1bce586

…ndas-iter-serializer

zhengruifeng reviewed Nov 26, 2025

View reviewed changes

Yicong-Huang added 14 commits November 30, 2025 10:50

Merge branch 'master' into SPARK-54316/refactor/consolidate-pandas-it…

2614bc3

…er-serializer

fix: simplify GroupPandasUDFSerializer

8e4173d

fix: simplify serialzier use baseclass

ce9fe22

fix: simplify wrappers

edace86

fix: comments

484962a

fix: revert unrelated changes

11cdc2b

refactor: consolidate GroupPandasUDFSerializer to support both SQL_GR…

22890b1

…OUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF

fix: simplify

e89139f

fix: batch concatination and key type

81b3ab8

fix: format

86e71cf

fix: simplify and format

172ca80

refactor: consolidate GroupPandasUDFSerializer to support both SQL_GR…

86e73b8

…OUPED_MAP_PANDAS_UDF and SQL_GROUPED_MAP_PANDAS_ITER_UDF

Merge branch 'master' into SPARK-54316/refactor/consolidate-pandas-it…

435e916

…er-serializer # Conflicts: # python/pyspark/sql/pandas/serializers.py # python/pyspark/worker.py

fix: missing import

b33c44e

Yicong-Huang requested a review from zhengruifeng December 3, 2025 09:36

Yicong-Huang changed the title ~~[WIP][SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer~~ [SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer Dec 3, 2025

zhengruifeng approved these changes Dec 3, 2025

View reviewed changes

zhengruifeng closed this in 4d3e4d6 Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54316][CORE][PYTHON][SQL] Consolidate `GroupPandasIterUDFSerializer` with `GroupPandasUDFSerializer`#53043

[SPARK-54316][CORE][PYTHON][SQL] Consolidate `GroupPandasIterUDFSerializer` with `GroupPandasUDFSerializer`#53043
Yicong-Huang wants to merge 35 commits intoapache:masterfrom
Yicong-Huang:SPARK-54316/refactor/consolidate-pandas-iter-serializer

Yicong-Huang commented Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

zhengruifeng Nov 25, 2025

Uh oh!

Yicong-Huang Nov 26, 2025

Uh oh!

zhengruifeng Nov 25, 2025

Uh oh!

Yicong-Huang Nov 26, 2025

Uh oh!

zhengruifeng Nov 26, 2025 •

edited

Loading

Uh oh!

Yicong-Huang Nov 26, 2025

Uh oh!

Yicong-Huang Dec 1, 2025

Uh oh!

zhengruifeng commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

zhengruifeng Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Nov 13, 2025 •

edited

Loading

zhengruifeng Nov 26, 2025 •

edited

Loading