Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Oct 23, 2025

What changes were proposed in this pull request?

This PR adds support for the Iterator[pandas.DataFrame] API in groupBy().applyInPandas(), enabling batch-by-batch processing of grouped data for improved memory efficiency and scalability.

Key Changes:

  1. New PythonEvalType: Added SQL_GROUPED_MAP_PANDAS_ITER_UDF (216) to distinguish iterator-based UDFs from standard grouped map UDFs

  2. Type Inference: Implemented automatic detection of iterator signatures:

    • Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
    • Tuple[Any, ...], Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
  3. Streaming Serialization: Created GroupPandasIterUDFSerializer that streams results without materializing all DataFrames in memory

  4. Configuration Change: Updated FlatMapGroupsInPandasExec which was hardcoding pythonEvalType = 201 instead of extracting it from the UDF expression (mirrored fix from FlatMapGroupsInArrowExec)

Why are the changes needed?

The existing applyInPandas() API loads entire groups into memory as single DataFrames. For large groups, this can cause OOM errors. The iterator API allows:

  • Memory Efficiency: Process data batch-by-batch instead of materializing entire groups
  • Scalability: Handle arbitrarily large groups that don't fit in memory
  • Consistency: Mirrors the existing applyInArrow() iterator API design

Does this PR introduce any user-facing changes?

Yes, this PR adds a new API variant for applyInPandas():

Before (existing API, still supported):

def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    return pdf.assign(v=(pdf.v - pdf.v.mean()) / pdf.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

After (new iterator API):

from typing import Iterator

def normalize(batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    # Process data batch-by-batch
    for batch in batches:
        yield batch.assign(v=(batch.v - batch.v.mean()) / batch.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

With Grouping Keys:

from typing import Iterator, Tuple, Any

def sum_by_key(key: Tuple[Any, ...], batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    total = 0
    for batch in batches:
        total += batch['v'].sum()
    yield pd.DataFrame({"id": [key[0]], "total": [total]})

df.groupBy("id").applyInPandas(sum_by_key, schema="id long, total double")

Backward Compatibility: The existing DataFrame-to-DataFrame API is fully preserved and continues to work without changes.

How was this patch tested?

  • Added test_apply_in_pandas_iterator_basic - Basic functionality test
  • Added test_apply_in_pandas_iterator_with_keys - Test with grouping keys
  • Added test_apply_in_pandas_iterator_batch_slicing - Pressure test with 10M rows, 20 columns
  • Added test_apply_in_pandas_iterator_with_keys_batch_slicing - Pressure test with keys

Was this patch authored or co-authored using generative AI tooling?

Yes, tests generated by Cursor.

@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add applyInPandas [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.pandas.typehints import infer_group_pandas_eval_type_from_func
from pyspark.sql.pandas.functions import PythonEvalType
import warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not need to re-import PythonEvalType and warnings


if dataframes_in_group == 1:
# Read all Arrow batches for this group first (must read from stream synchronously)
batches = list(ArrowStreamSerializer.load_stream(self, stream))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot load all batches here, the iterator API is designed to avoid loading all batches within a group so that it can migrate OOM

you can refer to

batch_iter = process_group(ArrowStreamSerializer.load_stream(self, stream))
yield batch_iter
# Make sure the batches are fully iterated before getting the next group
for _ in batch_iter:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better way is to update GroupPandasUDFSerializer to output the iterator,
and adjust the function wrapper of

            PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF,
            PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
            PythonEvalType.SQL_WINDOW_AGG_PANDAS_UDF,

but of course, we can start with a new serializer and deduplicate the codes later.

timezone, safecheck, _assign_cols_by_name, int_to_decimal_coercion_enabled
)
elif eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_ITER_UDF:
from pyspark.sql.pandas.serializers import GroupPandasIterUDFSerializer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put the import here

from pyspark.sql.pandas.serializers import (
ArrowStreamPandasUDFSerializer,
ArrowStreamPandasUDTFSerializer,
GroupPandasUDFSerializer,
GroupArrowUDFSerializer,
CogroupArrowUDFSerializer,
CogroupPandasUDFSerializer,
ArrowStreamUDFSerializer,
ApplyInPandasWithStateSerializer,
GroupPandasUDFSerializer,
TransformWithStateInPandasSerializer,
TransformWithStateInPandasInitStateSerializer,
TransformWithStateInPySparkRowSerializer,
TransformWithStateInPySparkRowInitStateSerializer,
ArrowStreamArrowUDFSerializer,
ArrowStreamAggArrowUDFSerializer,
ArrowBatchUDFSerializer,
ArrowStreamUDTFSerializer,
ArrowStreamArrowUDTFSerializer,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants