[SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark #37893

HeartSaVioR · 2022-09-15T04:10:57Z

What changes were proposed in this pull request?

This PR proposes to introduce the new API applyInPandasWithState in PySpark, which provides the functionality to perform arbitrary stateful processing in Structured Streaming.

This will be a pair API with applyInPandas - applyInPandas in PySpark covers the use case of flatMapGroups in Scala/Java API, applyInPandasWithState in PySpark covers the use case of flatMapGroupsWithState in Scala/Java API.

The signature of API follows:

# call this function after groupBy
def applyInPandasWithState(
    self,
    func: "PandasGroupedMapFunctionWithState",
    outputStructType: Union[StructType, str],
    stateStructType: Union[StructType, str],
    outputMode: str,
    timeoutConf: str,
) -> DataFrame

and the signature of user function follows:

def func(
    key: Tuple,
    pdf_iter: Iterator[pandas.DataFrame],
    state: GroupStateImpl
) -> Iterator[pandas.DataFrame]

(Please refer the code diff for function doc of new function.)

Major design choices which differ from existing APIs:

The new API is untyped, while flatMapGroupsWithState in typed API.

This is based on the nature of Python language - it's really duck typing and type definition is just a hint. We don't have the implementation of typed API for PySpark DataFrame.

This leads us to design the API to be untyped, meaning, all types for (input, state, output) should be Row-compatible. While we don't require end users to deal with Row directly, the model they will use for state and output must be convertible to Row with default encoder. If they want the python type for state which is not compatible with Row (e.g. custom class), they need to pickle and use BinaryType to store it.

This requires end users to specify the type of state and output via Spark SQL schema in the method.

Note that this helps to ensure compatibility for state data across Spark versions, as long as the encoders for 1) python type -> python Row and 2) python Row -> UnsafeRow are not changed. We won't change the underlying data layout for UnsafeRow, as it will break all of existing stateful query.

The new API will produce Pandas DataFrame to user function, while flatMapGroupsWithState produces iterator of rows.

We decided to follow the user experience applyInPandas provides for both consistency and performance (Arrow batching, vectorization, etc). This leads us to design the user function to leverage pandas DataFrame rather than iterator of rows. While this leads inconsistency of the UX from the Scala/Java API, we don't think this will come up as a problem since Pandas is considered as de-facto standard for Python data scientists.

The new API will produce iterator of Pandas DataFrame to user function and also require to return iterator of Pandas DataFrame to address scalability.

There is known limitation of applyInPandas, scalability. It basically requires data in a specific group to be fit into memory. During the design phase of new API, we decided to address the scalability rather than inheriting the limitation.

To address the scalability, we tweak the user function to receive an iterator (generator) of Pandas DataFrame instead of a single Pandas DataFrame, and also return an iterator (generator) of Pandas DataFrame. We think it does not hurt the UX too much, as for-each and yield would be enough to deal with the requirement of dealing with iterator.

Implementation perspective, we split the data in a specific group to multiple chunks, which each chunk is stored and sent as "an" Arrow RecordBatch, and then finally materialized to "a" pandas DataFrame. This way, as long as end users don't materialize lots of pandas DataFrames from the iterator at the same time, only one chunk will be materialized into memory which is scalable. Similar logic applies to the output of user function, hence scalable as well.

The new API also bin-packs the data with multiple groups into "an" Arrow RecordBatch.

Given the API is mainly used for streaming workload, it could be high likely that the volume of data in a specific group may not be huge enough to leverage the benefit of Arrow columnar batching, which would hurt the performance. To address this, we also do the opposite thing what we do for scalability, bin-pack. That said, an Arrow RecordBatch can contain data for multiple groups, as well as a part of data for specific group. This address both aspects of concerns together, scalability and performance.

Note that we are not implementing all of features Scala/Java API provide from the initial phase. e.g. Support for batch query and support for initial state will be left as TODO.

Why are the changes needed?

PySpark users don't have a way to perform arbitrary stateful processing in Structured Streaming and being forced to use either Java or Scala which is unacceptable for users in many cases. This PR enables PySpark users to deal with it without moving to Java/Scala world.

Does this PR introduce any user-facing change?

Yes. We are exposing new public API in PySpark which performs arbitrary stateful processing.

How was this patch tested?

N/A. We will make sure test suites are constructed via E2E manner under SPARK-40431 - #37894

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

...rc/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasWithStateExec.scala

HeartSaVioR · 2022-09-16T01:35:44Z

cc. @viirya @HyukjinKwon
Please take a look into this. Thanks. I understand this is huge and a bit complicated in some part, logic around binpack/chunk. Please feel free to leave comments if the code comment isn't sufficient to understand, I'll try my best to cover it.

And please help finding more eyes of reviewers. I'm not sure who would be available to review the PR touching both PySpark and Structured Streaming, but probably knowing one area and reviewing one area is still also helpful.

HyukjinKwon · 2022-09-16T11:00:38Z

Will take a close look next Monday in KST.

alex-balikov · 2022-09-16T22:00:46Z

python/pyspark/sql/pandas/_typing/__init__.pyi

 ]

+PandasGroupedMapFunctionWithState = Callable[
+    [Any, Iterable[DataFrameLike], GroupStateImpl], Iterable[DataFrameLike]


Can the type be GroupState without the 'Impl' - looks bad in public api.

Either we can split out interface and implementation, or just change the name. I'm fine with any direction.
cc. @HyukjinKwon What'd be the best practice of such case?

I am fine either way too. Users aren't able to create this instance directly anyway.

One concern is that if we happen to have a different implementation of GroupState in the far future. But the type is dynamic anyway so I don't worry too much.

Thanks, I just renamed GroupStateImpl to GroupState. Once we find the necessity we can use the same name to become interface and move out the implementation (I guess this is what @HyukjinKwon said the type is dynamic but please let me know if I miss something.)

alex-balikov · 2022-09-16T22:09:51Z

python/pyspark/sql/pandas/group_ops.py

+        will be input data -> state timeout. When the function is invoked for state timeout, there
+        will be no data being presented.
+
+        The function should takes parameters (key, Iterator[`pandas.DataFrame`], state) and


The function takes parameters ... and returns Iterator[...]

This follows the existing method doc in applyInPandas.

The "function" here refers to user function end users will provide, not a function Spark provides as public API, so using should here does not seem to be wrong. The mood is something like "you should construct an user function blabla...". "s" should be removed though.

alex-balikov · 2022-09-16T22:13:50Z

python/pyspark/sql/pandas/group_ops.py

+
+        For each group, all columns are passed together as `pandas.DataFrame` to the user-function,
+        and the returned `pandas.DataFrame` across all invocations are combined as a
+        :class:`DataFrame`. Note that the user function should loop through and process all


'Note that the user function should loop through and process all elements in the iterator. The user function should not make a guess of the number of elements in the iterator.'

Why? This sounds like the use must process all iterator entries or otherwise something bad would happen. I would reword this to indicate that the grouped data could be split into multiple entries -

'Note that the group data may be split as multiple Iterator records and the user function should not assume that it receives a single record.'

I would still suggest we have a design discussion about splitting groups unnecessary as I believe we should not do this.

Note that the user function should loop through and process all elements in the iterator.

Why? This sounds like the use must process all iterator entries or otherwise something bad would happen. I would reword this to indicate that the grouped data could be split into multiple entries -

I agree this is too conservative and we can remove that once there is technically no issue. I don't think we never have such a test for even existing flatMapGroupsWithState so we actually don't clearly know what happens if we pull a part of data from group.

The user function should not make a guess of the number of elements in the iterator.

I would still suggest we have a design discussion about splitting groups unnecessary as I believe we should not do this.

I think there is a room for discussion on how to split group with in mind we also binpack in terms of performance, but I really doubt this has to be an interface contract. For former, it's not a first class concern and we shouldn't block this PR. For latter, I really want to see what is the real use case which leverages the interface contract, and how much it will be harder to implement for the same if we do not guarantee such contract.

Stricter interface contract can be loose without breaking anything, looser interface contract can never be stricter without breaking compatibility. Why not we go with conservative till we are very clear there is a clear use case?

alex-balikov · 2022-09-16T22:18:04Z

python/pyspark/sql/pandas/group_ops.py

+        The `stateStructType` should be :class:`StructType` describing the schema of user-defined
+        state. The value of state will be presented as a tuple, as well as the update should be
+        performed with the tuple. User defined types e.g. native Python class types are not
+        supported. Alternatively, you can pickle the data and produce the data as BinaryType, but


'Alternatively, you can pickle the data ...' - instead say

'For such cases, the user should pickle the data into BinaryType. Note that this approach may be sensitive to backwards and forward compatibility issues of Python picks and Spark can not guarantee compatibility.

though I think you could drop the note as that is orthogonal to Spark.

Let's simply just remove the suggestion.

alex-balikov · 2022-09-16T22:20:04Z

python/pyspark/sql/pandas/group_ops.py

+        it is tied to the backward and forward compatibility of pickle in Python, and Spark itself
+        does not guarantee the compatibility.
+
+        The length of each element in both input and returned value, `pandas.DataFrame`, can be


'The size of each DataFrame in both the input and output ...'

'The number of DataFrames in both the input and output can also be arbitrary.'

alex-balikov · 2022-09-16T22:44:35Z

python/pyspark/sql/pandas/group_ops.py

+        schema if specified as strings, or match the field data types by position if not strings,
+        e.g. integer indices.
+
+        The `stateStructType` should be :class:`StructType` describing the schema of user-defined


... describing the schema of the user-defined state. The value of the state will be presented as a tuple and the update should be performed with a tuple.

alex-balikov · 2022-09-16T22:47:07Z

python/pyspark/sql/pandas/group_ops.py

+
+        The `stateStructType` should be :class:`StructType` describing the schema of user-defined
+        state. The value of state will be presented as a tuple, as well as the update should be
+        performed with the tuple. User defined types e.g. native Python class types are not


Not StructType types, e.g. user-defined or native Python types are not supported.

It's a bit tricky - native Python types contain int, float, str, ... and of course they are supported. Probably the clear definition is "python types are supported as long as the default encoder can convert to the Spark SQL type". Not sure we have a clear documentation describing the matrix of compatibility.

cc. @HyukjinKwon could you please help us make this clear?

I think we can just say that "the corresponding Python types for :class:DataType are supported". Documented here https://spark.apache.org/docs/latest/sql-ref-datatypes.html (click python tab)

alex-balikov · 2022-09-16T22:51:31Z

python/pyspark/sql/pandas/group_ops.py

+        func : function
+            a Python native function to be called on every group. It should takes parameters
+            (key, Iterator[`pandas.DataFrame`], state) and returns Iterator[`pandas.DataFrame`].
+            Note that the type of key is tuple, and the type of state is


Note that the type of the key is tuple and the type of the state is ...

alex-balikov · 2022-09-16T22:52:14Z

python/pyspark/sql/pandas/group_ops.py

+            :class:`pyspark.sql.streaming.state.GroupStateImpl`.
+        outputStructType : :class:`pyspark.sql.types.DataType` or str
+            the type of the output records. The value can be either a
+            :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.


can you provide an example here of the string?

In the doc or here? All other PySpark method docs do not have example of this string.

Maybe we could have examples like other APIs do and provide DDL-formatted type string to compensate.

alex-balikov · 2022-09-16T22:52:44Z

python/pyspark/sql/pandas/group_ops.py

+            :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.
+        stateStructType : :class:`pyspark.sql.types.DataType` or str
+            the type of the user-defined state. The value can be either a
+            :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.


same - can you provide an example of the string

python/pyspark/sql/pandas/serializers.py

HyukjinKwon

Implementation-wise, looks pretty good

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

HyukjinKwon · 2022-09-19T03:46:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "configured value.")
+      .version("3.4.0")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefaultString("64MB")


I think we should have a general configuration for this later that applies to all Arrow batch (SPARK-23258). I think we should reuse spark.sql.execution.arrow.maxRecordsPerBatch for the time being.

Batching has multiple purposes - here we do this for scalability, meaning it'd be closer to the purpose if we can batch with size rather than the number of rows. I'm OK with changing the condition on cutting out arrow batch to the number of rows, as it's configurable hence users can adjust it to smaller if they encounter the memory issue in any way.

cc. @alex-balikov Does this make sense to you?

Ah, SPARK-23258 is about restricting arrow record batch to size, seems similar with what we propose in this PR. It's still questionable if we calculate in every addition of row (accurate but would be super bad on performance) or do sampling as we do here (cannot be accurate and err might be non-trivial with variable-length columns).

I agree that expressing the limit in terms of bytes is more meaningful that records. However we estimate the bytes size efficiently. Specifically here I would rename 'softLimitSizePerBatch' by removing 'soft' - we can clarify in the comment about that and also including 'Bytes' - 'batchSizeLimitBytes' . Also wonder if we should have the property specific to applyInPandasWithState or should we make it general - remove the applyInPandasWithState scoping even if we do not support this limit initially, seems like generally meaningful and we can follow up fixing the other places as a bug.

(closing the loop) We decided to simply use the number of rows for the condition of constructing Arrow RecordBatch. This will remove all new configs being introduced here, as well as reduce lots of complexity.

HyukjinKwon · 2022-09-19T03:46:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "complete the ArrowRecordBatch, which may hurt both throughput and latency.")
+      .version("3.4.0")
+      .timeConf(TimeUnit.MILLISECONDS)
+      .createWithDefaultString("100ms")


For this, can we just leverage spark.sql.execution.pandas.udf.buffer.size (the feature this PR adds already respects it) if the flush time matters? That configuration is for the purpose.

I'm not 100% clear how spark.sql.execution.pandas.udf.buffer.size works. Current logic won't work if this config is able to split an arrow record batch further down to multiple, as we rely on offset and the number of rows to split the range of data from overall arrow record batch. It relies on the fact that the logic has full control of constructing arrow record batch.

This config is to have two different aspects of closing the arrow record batch, 1) size 2) time on batching.

...rc/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala

...rc/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasWithStateExec.scala

…FlatMapGroupsInPandasWithStateExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

…ysis/UnsupportedOperationChecker.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

…/python/FlatMapGroupsInPandasWithStateExec.scala" This reverts commit b93e488.

alex-balikov · 2022-09-19T17:40:33Z

python/pyspark/worker.py

+                )
+            # the number of columns of result have to match the return type
+            # but it is fine for result to have no columns at all if it is empty
+            if not (


if not ... ?

This is borrowed from above function - I think we took if not here because it's more intuitive and easier to think of "valid" case and apply "not" to reverse, rather than manually convert the conditions to be the contraposition.

ah, nevermind, I just misread the code.

python/pyspark/worker.py

alex-balikov · 2022-09-19T17:42:47Z

python/pyspark/worker.py

+        elif eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE:
+            soft_limit_bytes_per_batch = runner_conf.get(
+                "spark.sql.execution.applyInPandasWithState.softLimitSizePerBatch",
+                (64 * 1024 * 1024),


can the default be value be defined in some more prominent place? Also the property names.

alex-balikov · 2022-09-19T17:43:29Z

python/pyspark/worker.py

            ser = CogroupUDFSerializer(timezone, safecheck, assign_cols_by_name)
+        elif eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE:
+            soft_limit_bytes_per_batch = runner_conf.get(
+                "spark.sql.execution.applyInPandasWithState.softLimitSizePerBatch",


I do not think 'soft' is necessary i the parameter name. Leave that for the comment describing that this is a soft limit.

alex-balikov · 2022-09-19T17:44:22Z

python/pyspark/worker.py

+            soft_limit_bytes_per_batch = int(soft_limit_bytes_per_batch)
+
+            min_data_count_for_sample = runner_conf.get(
+                "spark.sql.execution.applyInPandasWithState.minDataCountForSample", 100


similar comment about the property names and default values here and everywhere else - can they be defined in a more prominent place

alex-balikov · 2022-09-19T17:45:28Z

python/pyspark/worker.py

+            min_data_count_for_sample = int(min_data_count_for_sample)
+
+            soft_timeout_millis_purge_batch = runner_conf.get(
+                "spark.sql.execution.applyInPandasWithState.softTimeoutPurgeBatch", 100


python/pyspark/worker.py

alex-balikov · 2022-09-19T17:57:58Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

+          val aggsInQuery = collectStreamingAggregates(plan)
+
+          if (aggsInQuery.isEmpty) {
+            // applyInPandasWithState without aggregation: operation's output mode must


why do we even have operation output mode. We are defining a new api, can we just drop this parameter from the api if we are going to be enforcing for it t match the output mode?

Now I can imagine the case which current requirement of providing separate output mode prevents the unintentional behavior:

They implemented the user function for flatMapGroupsWithState with append mode.

They ran the query with append mode.

After that, they changed the output mode for the query to update mode for some reason.

The user function is unchanged to account the change of update mode.

We haven't allowed the query to run as of now, and we are going to allow the query to run if we drop the parameter.

PS. I'm not a believer that end users can implement their user function accordingly based on output mode, but that is a fundamental API design issue of original flatMapGroupsWithState which is separate one.

HeartSaVioR · 2022-09-20T12:22:55Z

I tried to add method level doc as many as possible, except the case I think it's unnecessary (I might still miss some pieces).

I don't go with the approach trying to explain all of the parameters with types though, for reasons:

For Python code, it'd be really hard to maintain the doc in sync with the code.
For Scala/Java we have been omitting the doc of parameters or even entire class/method if it's obvious from the name.

In both languages, we strongly encourage to have method doc and parameter explanation for public APIs. Here we technically add only one public method applyInPandasWithState in group_ops, and a couple of public classes GroupState and GroupStateTimeout in PySpark. Others are all internal and private.

HeartSaVioR · 2022-09-20T12:32:42Z

python/pyspark/sql/pandas/group_ops.py

+            timeout configuration for groups that do not receive data for a while. valid values
+            are defined in :class:`pyspark.sql.streaming.state.GroupStateTimeout`.
+
+        # TODO: Examples


This is something I still need to do - let me come up with some examples. I guess we probably can't run automated test from the example section though.

I just added a simple example - let me come up with full example code in examples directory. I'll file a new JIRA ticket for this.

https://issues.apache.org/jira/browse/SPARK-40509

alex-balikov · 2022-09-20T17:59:36Z

python/pyspark/worker.py

+                )
+            # the number of columns of result have to match the return type
+            # but it is fine for result to have no columns at all if it is empty
+            if not (


ah, nevermind, I just misread the code.

alex-balikov · 2022-09-20T18:26:13Z

...rc/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasWithStateExec.scala

+ * @param eventTimeWatermark event time watermark for the current batch
+ * @param child logical plan of the underlying data
+ */
+case class FlatMapGroupsInPandasWithStateExec(


I wonder if this can be merged with the regular FlatMapGroupsWithStateExec. Maybe as a followup cleanup.

We always have a separate exec implementation for Scala/Java vs Python since the constructor parameters are different. (We are leveraging case class for logical/physical plan, so difference of the constructor parameters warrants a new class.) So this is intentional. As a compromise we did the refactor to have FlatMapGroupsWithStateExecBase as a base class.

alex-balikov · 2022-09-20T18:31:54Z

...rc/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasWithStateExec.scala

+                val shouldWriteState = newGroupState.isUpdated || newGroupState.isRemoved ||
+                  hasTimeoutChanged
+
+                if (shouldWriteState) {


what happens if

newGroupState.isRemoved && newGroupState.getTimeoutTimestampMs.isPresent()

basically if the state was removed but there is still timeout set? Will you keep the user state object around till the timeout fires?

basically if the state was removed but there is still timeout set? Will you keep the user state object around till the timeout fires?

I'm not 100% understanding the intention of the original codebase, but it seems so.

Here the removal of state is removal of "value object" of the state. We don't allow users to set "null" on value object, hence removal of state is the only way to clear the value object. In the meanwhile, we seem to still allow setting the timeout with state having undefined value object.

The status of the state would be the same when you start with new state and only set the timeout without setting the value object. Given we allow this, above case probably has to be allowed as well.

alex-balikov · 2022-09-20T18:36:09Z

...core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStateWriter.scala

+  //
+  // ArrowStreamWriter supports only single VectorSchemaRoot, which means all Arrow RecordBatches
+  // being sent out from ArrowStreamWriter should have same schema. That said, we have to construct
+  // "an" Arrow schema to contain both types of data, and also construct Arrow RecordBatches to


to contain both data and state, and also construct ArrowBatches to contain both data and state.

alex-balikov · 2022-09-20T18:59:04Z

...core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStateWriter.scala

+}
+
+object ApplyInPandasWithStateWriter {
+  val STATE_METADATA_SCHEMA: StructType = StructType(


please comment on the semantics of each column. Specifically isLastChunk is not obvious but important for the operation of the protocol.

Done. Additionally explained why the state metadata has the metadata of chunk as well.

HeartSaVioR · 2022-09-21T07:40:28Z

@HyukjinKwon @alex-balikov
Please go with another round of review, thanks in advance!

…lized yet

alex-balikov · 2022-09-21T18:35:34Z

python/pyspark/sql/pandas/group_ops.py

+        state will be saved across invocations.
+
+        The function should take parameters (key, Iterator[`pandas.DataFrame`], state) and
+        returns another Iterator[`pandas.DataFrame`]. The grouping key(s) will be passed as a tuple


return another ...

alex-balikov · 2022-09-21T18:43:00Z

python/pyspark/sql/pandas/serializers.py

+               3.A. Extract the data out from entire data via the information of data range.
+               3.B. Construct a new state instance if the state information is the first occurrence
+                    for the current grouping key.
+               3.C. Leverage existing new state instance if the state instance is already available


Leverage the existing state instance if it is already available for the current grouping key...

alex-balikov · 2022-09-21T20:29:13Z

python/pyspark/worker.py

+            # the number of columns of result have to match the return type
+            # but it is fine for result to have no columns at all if it is empty
+            if not (
+                len(result.columns) == len(return_type) or len(result.columns) == 0 and result.empty


may be it is just me but I would suggest adding parentheses so we do not rely on and/or priority

No it's not just you. I planned it but forgot it. Thanks for the pointer.

HeartSaVioR · 2022-09-21T20:45:53Z

https://github.com/HeartSaVioR/spark/actions/runs/3098349789/jobs/5019498380

BasicSchedulerIntegrationSuite.super simple job
org.scalatest.exceptions.TestFailedException: Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42)
 [Check failure on line 210 in YarnClusterSuite](https://github.com/HeartSaVioR/spark/commit/f1000487960fa19aff9979211db68e63ec4384e0#annotation_4680752580) 

YarnClusterSuite.run Spark in yarn-client mode with different configurations, ensuring redaction
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 190 times over 3.001220665 minutes. Last failure message: handle.getState().isFinal() was false.
 [Check failure on line 210 in YarnClusterSuite](https://github.com/HeartSaVioR/spark/commit/f1000487960fa19aff9979211db68e63ec4384e0#annotation_4680752582) 

YarnClusterSuite.run Spark in yarn-cluster mode with different configurations, ensuring redaction
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 190 times over 3.001046460266666 minutes. Last failure message: handle.getState().isFinal() was false.
 [Check failure on line 210 in YarnClusterSuite](https://github.com/HeartSaVioR/spark/commit/f1000487960fa19aff9979211db68e63ec4384e0#annotation_4680752584) 

YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 190 times over 3.0008878969166664 minutes. Last failure message: handle.getState().isFinal() was false.
 [Check failure on line 210 in YarnClusterSuite](https://github.com/HeartSaVioR/spark/commit/f1000487960fa19aff9979211db68e63ec4384e0#annotation_4680752586) 


YarnClusterSuite.SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local'
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 190 times over 3.0009406496 minutes. Last failure message: handle.getState().isFinal() was false.

None of test failures is related to the change of this PR. Since we updated the PR again via dd7a655, let's see the build.

HyukjinKwon

LGTM otherwise

HyukjinKwon · 2022-09-22T03:14:05Z

python/pyspark/sql/pandas/group_ops.py

+        For a streaming Dataset, the function will be invoked first for all input groups and then
+        for all timed out states where the input data is set to be empty. Updates to each group's
+        state will be saved across invocations.


Suggested change

For a streaming Dataset, the function will be invoked first for all input groups and then

for all timed out states where the input data is set to be empty. Updates to each group's

state will be saved across invocations.

For a streaming :class:`DataFrame`, the function will be invoked first for all input groups

and then for all timed out states where the input data is set to be empty. Updates to

each group's state will be saved across invocations.

HyukjinKwon · 2022-09-22T03:14:51Z

python/pyspark/sql/pandas/group_ops.py

+        user-defined state. The value of the state will be presented as a tuple, as well as the
+        update should be performed with the tuple. The corresponding Python types for
+        :class:DataType are supported. Please refer to the page
+        https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python tab).


Suggested change

https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python tab).

https://spark.apache.org/docs/latest/sql-ref-datatypes.html (Python tab).

HyukjinKwon · 2022-09-22T03:15:27Z

python/pyspark/sql/pandas/group_ops.py

+        The size of each DataFrame in both the input and output can be arbitrary. The number of
+        DataFrames in both the input and output can also be arbitrary.


Suggested change

The size of each DataFrame in both the input and output can be arbitrary. The number of

DataFrames in both the input and output can also be arbitrary.

The size of each `pandas.DataFrame` in both the input and output can be arbitrary.

The number of DataFrames in both the input and output can also be arbitrary.

I think we can extract some notes from the description to Notes section. But no biggie.

HyukjinKwon · 2022-09-22T03:21:27Z

python/pyspark/sql/pandas/group_ops.py

+        ...     for pdf in pdf_iter:
+        ...         total_len += len(pdf)
+        ...     state.update((total_len,))
+        ...     yield pd.DataFrame({"id": [key[0]], "countAsString": [str(total_len)]})


Suggested change

... yield pd.DataFrame({"id": [key[0]], "countAsString": [str(total_len)]})

... yield pd.DataFrame({"id": [key[0]], "countAsString": [str(total_len)]})

...

HyukjinKwon · 2022-09-22T03:25:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

    count += 1
  }

+  def sizeInBytes(): Int = {


I think we don't need sizeInBytes and getSizeInBytes anymore

HyukjinKwon · 2022-09-22T03:34:44Z

My comments are just nits. I will merge this in first to move forward.

Merged to master.

HeartSaVioR · 2022-09-22T03:36:24Z

Thanks @HyukjinKwon and @alex-balikov for thoughtful reviewing and merging!
I'll handle the latest comments as a follow-up PR.

…in PySpark ### What changes were proposed in this pull request? This PR adds the test suites for #37893, applyInPandasWithState. The new test suite mostly ports E2E test cases from existing [flatMapGroupsWithState](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala). ### Why are the changes needed? Tests are missing in #37893 by intention to reduce the size of change, and this PR fills the gap. ### Does this PR introduce _any_ user-facing change? No, test only. ### How was this patch tested? New test suites. Closes #37894 from HeartSaVioR/SPARK-40435-on-top-of-SPARK-40434-SPARK-40433-SPARK-40432. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

### What changes were proposed in this pull request? This PR addresses the review comments from the last round of review from HyukjinKwon in #37893. ### Why are the changes needed? Better documentation and removing unnecessary code. ### Does this PR introduce _any_ user-facing change? Slight documentation change. ### How was this patch tested? N/A Closes #37964 from HeartSaVioR/SPARK-40434-followup. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

github-actions bot added CORE PYTHON SQL STRUCTURED STREAMING labels Sep 15, 2022

HeartSaVioR commented Sep 15, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala Outdated Show resolved Hide resolved

...rc/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasWithStateExec.scala Outdated Show resolved Hide resolved

HeartSaVioR and others added 3 commits September 16, 2022 07:15

[SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

444f9a4

meta-commit to credit properly on co-authorship

79ba311

replace SPARK-XXXXX with real JIRA ticket

0000994

HeartSaVioR marked this pull request as ready for review September 15, 2022 22:16

HeartSaVioR force-pushed the SPARK-40434-on-top-of-SPARK-40433-SPARK-40432 branch from f4435a2 to 0000994 Compare September 15, 2022 22:16

HeartSaVioR changed the title ~~[DRAFT][DO-NOT-MERGE][SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark~~ [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark Sep 15, 2022

HeartSaVioR added 4 commits September 16, 2022 10:51

fix

caa4924

Add missing piece

76bb404

unused import

a2f25e3

fix for Scala 2.13

bb3a80a

reformat as suggested by linter

4d3c7e9

alex-balikov reviewed Sep 17, 2022

View reviewed changes

alex-balikov reviewed Sep 19, 2022

View reviewed changes

python/pyspark/sql/pandas/serializers.py Show resolved Hide resolved

python/pyspark/sql/pandas/serializers.py Show resolved Hide resolved

HyukjinKwon reviewed Sep 19, 2022

View reviewed changes

HeartSaVioR and others added 8 commits September 19, 2022 13:21

Update sql/core/src/main/scala/org/apache/spark/sql/execution/python/…

b93e488

…FlatMapGroupsInPandasWithStateExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/anal…

5d23a6d

…ysis/UnsupportedOperationChecker.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

1st reflection of feedbacks

e757be0

reflect suggestion

43929ac

WIP still documenting...

e725995

Rename GroupStateImpl to GroupState in PySpark (No additional interface)

9c6bf60

WIP still updating the doc

69bb3e8

Revert "Update sql/core/src/main/scala/org/apache/spark/sql/execution…

516fa4f

…/python/FlatMapGroupsInPandasWithStateExec.scala" This reverts commit b93e488.

alex-balikov reviewed Sep 19, 2022

View reviewed changes

HeartSaVioR added 5 commits September 20, 2022 17:44

slight refactor

0fee506

further documentation

7051799

further document...

5142941

further doc

95a1400

apply suggestion

295bc9b

HeartSaVioR commented Sep 20, 2022

View reviewed changes

alex-balikov reviewed Sep 20, 2022

View reviewed changes

HeartSaVioR added 6 commits September 21, 2022 05:47

update the doc

9f52c80

Add example

8d23d46

slight addition

fc3dca2

fix style

119daf3

fix bug during updates of code review

4b7c667

fix another bug during code review

426f5e7

HeartSaVioR mentioned this pull request Sep 21, 2022

[SPARK-40435][SS][PYTHON] Add test suites for applyInPandasWithState in PySpark #37894

Closed

fix on pydoc

83f2555

HeartSaVioR added 3 commits September 21, 2022 17:51

Fix a silly bug where the value of the state is removed or not initia…

38eec2d

…lized yet

fix an edge-case being figured out from newer test case

8133dcd

loosen the requirement

f100048

alex-balikov approved these changes Sep 21, 2022

View reviewed changes

reflect feedbacks

dd7a655

HyukjinKwon approved these changes Sep 22, 2022

View reviewed changes

HyukjinKwon closed this in 603dc50 Sep 22, 2022

HeartSaVioR mentioned this pull request Sep 22, 2022

[SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments #37964

Closed

	https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python tab).
	https://spark.apache.org/docs/latest/sql-ref-datatypes.html (Python tab).

		The size of each DataFrame in both the input and output can be arbitrary. The number of
		DataFrames in both the input and output can also be arbitrary.

	... yield pd.DataFrame({"id": [key[0]], "countAsString": [str(total_len)]})
	... yield pd.DataFrame({"id": [key[0]], "countAsString": [str(total_len)]})
	...

[SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark #37893

[SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark #37893

Uh oh!

Conversation

HeartSaVioR commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Sep 15, 2022 •

edited

Loading

HeartSaVioR commented Sep 16, 2022 •

edited

Loading

HeartSaVioR Sep 19, 2022 •

edited

Loading

HeartSaVioR Sep 19, 2022 •

edited

Loading

HeartSaVioR Sep 19, 2022 •

edited

Loading

HeartSaVioR Sep 20, 2022 •

edited

Loading