[SPARK-48755][SS][PYTHON] transformWithState pyspark base implementation and ValueState support #47133

bogao007 · 2024-06-27T22:26:06Z

What changes were proposed in this pull request?

Base implementation for Python State V2
Implemented ValueState

Below we specifically highlight some key files/components for this change:

Python
- group_ops.py: defines transformWithStateInPandas function and its udf.
- serializer.py: defines how we load and dump arrow streams for data rows between the JVM and Python process.
- stateful_processor.py: defines StatefulProcessorHandle, ValueState functionalities and StatefulProcessor interface.
- state_api_client.py and value_state_client.py: contains logics to send API request in protobuf format to the server (JVM)
Scala
- TransformWithStateInPandasExec: physical operator for TransformWithStateInPandas.
- TransformWithStateInPandasPythonRunner: python runner that launches python worker that executes the udf.
- TransformWithStateInPandasStateServer: class that handles state requests in protobuf format from python side.

Why are the changes needed?

Support Python State V2 API

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added unit tests.
Did local integration test with below command

import pandas as pd
from pyspark.sql import Row
from pyspark.sql.streaming import StatefulProcessor, StatefulProcessorHandle
from pyspark.sql.types import StructType, StructField, LongType, StringType
from typing import Iterator
spark.conf.set("spark.sql.streaming.stateStore.providerClass","org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")
spark.conf.set("spark.sql.shuffle.partitions","1")
output_schema = StructType([
    StructField("value", LongType(), True)
])
state_schema = StructType([
    StructField("id", LongType(), True),
    StructField("value", StringType(), True),
    StructField("comment", StringType(), True)
])

class SimpleStatefulProcessor(StatefulProcessor):
  def init(self, handle: StatefulProcessorHandle) -> None:
    self.value_state = handle.getValueState("testValueState", state_schema)
  def handleInputRows(self, key, rows) -> Iterator[pd.DataFrame]:
    self.value_state.update((1,"test_value","comment"))
    exists = self.value_state.exists()
    print(f"value state exists: {exists}")
    value = self.value_state.get()
    print(f"get value: {value}")
    print("clearing value state")
    self.value_state.clear()
    print("value state cleared")
    return rows
  def close(self) -> None:
    pass

q = spark.readStream.format("rate").option("rowsPerSecond", "1").option("numPartitions", "1").load().groupBy("value").transformWithStateInPandas(stateful_processor = SimpleStatefulProcessor(), outputStructType=output_schema, outputMode="Update", timeMode="None").writeStream.format("console").option("checkpointLocation", "/tmp/streaming/temp_ckp").outputMode("update").start()

Verified from the logs that value state methods work as expected for key 11

value state exists: True
get value:    id       value  comment
0   1  test_value  comment
clearing value state
value state cleared

Was this patch authored or co-authored using generative AI tooling?

No

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

...ore/src/test/scala/org/apache/spark/sql/streaming/util/TransformWithStateInPandasSuite.scala

python/pyspark/sql/pandas/group_ops.py

HyukjinKwon · 2024-06-28T02:03:28Z

Mind filing a JIRA?

bogao007 · 2024-06-28T02:55:03Z

Mind filing a JIRA?

Yeah, will do, thanks!

sahnib

Thanks for making these changes. Reviewed the Python bits, still reviewing Scala bits.

python/pyspark/sql/pandas/_typing/__init__.pyi

python/pyspark/sql/pandas/group_ops.py

python/pyspark/sql/pandas/serializers.py

python/pyspark/sql/streaming/state_api_client.py

python/pyspark/sql/streaming/stateful_processor.py

python/pyspark/sql/streaming/StateMessage.proto

python/pyspark/sql/streaming/stateful_processor.py

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

sahnib

Thanks for making the changes. Left some comments after the second pass.

python/pyspark/sql/pandas/group_ops.py

sahnib · 2024-07-16T14:44:19Z

python/pyspark/sql/pandas/group_ops.py

+        >>> class SimpleStatefulProcessor(StatefulProcessor):
+        ...   def init(self, handle: StatefulProcessorHandle) -> None:
+        ...     self.value_state = handle.getValueState("testValueState", state_schema)
+        ...   def handleInputRows(self, key, rows) -> Iterator[pd.DataFrame]:
+        ...     self.value_state.update("test_value")
+        ...     exists = self.value_state.exists()
+        ...     value = self.value_state.get()
+        ...     self.value_state.clear()
+        ...     return rows
+        ...   def close(self) -> None:
+        ...     pass


[nit] It might be more useful to provide a running count example, where we store values above a specified threshold in the state (to keep track of violations). [something like processing temperature sensor values in a stream]

python/pyspark/sql/pandas/serializers.py

sahnib · 2024-07-16T14:51:49Z

python/pyspark/sql/pandas/serializers.py

+        In addition, this function further groups the return of `gen_data_and_state` by the state
+        instance (same semantic as grouping by grouping key) and produces an iterator of data
+        chunks for each group, so that the caller can lazily materialize the data chunk.


It seems like this documentation is referring to the ApplyInPandasWithState serializer which transfers both state and data.

python/pyspark/worker.py

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

ericm-db · 2024-07-16T16:49:48Z

python/pyspark/sql/pandas/serializers.py

+
+        def generate_data_batches(batches):
+            for batch in batches:
+                data_pandas = [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]


Not sure if this is a common pattern in Python, but this line is a little hard to read

ericm-db · 2024-07-16T16:51:43Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+
+        self.assertEqual(q.name, "this_query")
+        self.assertTrue(q.isActive)
+        q.processAllAvailable()


Should we include q.awaitTermination()?

+1 Shall we ensure the query to be stopped instead of relying on other test to stop leaking query?

python/pyspark/sql/pandas/group_ops.py

python/pyspark/sql/pandas/serializers.py

anishshri-db · 2024-07-16T20:07:24Z

python/pyspark/sql/streaming/StateMessage.proto

+
+package pyspark.sql.streaming;
+
+message StateRequest {


is it possible to add some high level comments here or in some other Python file ?

python/pyspark/sql/streaming/state_api_client.py

python/pyspark/sql/streaming/StateMessage_pb2.pyi

python/pyspark/sql/streaming/stateful_processor.py

HyukjinKwon

Looks fine at high level

bogao007 · 2024-08-12T00:47:06Z

Looks fine at high level

Thanks @HyukjinKwon! I addressed your comments, could you help take another look?

HyukjinKwon · 2024-08-12T02:27:40Z

I defer to @HeartSaVioR . I don;t have any high level concern

HeartSaVioR · 2024-08-12T13:24:55Z

Let's call this out as transformWithState explicitly as now we finalize the name of the API.

HeartSaVioR

9/31 files reviewed (probably several remaining files are auto-generated) - will continue tomorrow. Please leave file-level comment for auto-generated files.

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

HeartSaVioR · 2024-08-12T13:29:02Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

    newChild: LogicalPlan): FlatMapGroupsInPandasWithState = copy(child = newChild)
 }

+object TransformWithStateInPandas {


Any reason we can't just use the generated constructor of case class? params here are exactly the same with constructor param in case class.

yeah removed it since it's redundant, thanks for catching this!

HeartSaVioR · 2024-08-12T13:31:24Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

+  }
+}
+
+case class TransformWithStateInPandas(


nit: shall we add a short description as class doc while we are here?

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

HeartSaVioR · 2024-08-12T13:56:53Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

+
+        val outputIterator = executePython(data, output, runner)
+
+        CompletionIterator[InternalRow, Iterator[InternalRow]](outputIterator, {


Where we count numOutputRows in this node?

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

HeartSaVioR

27 / 31 files - I'll continue reviewing 4 files, hopefully by today (or early tomorrow).

HeartSaVioR · 2024-08-13T05:13:09Z

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

+  private val sqlConf = SQLConf.get
+  private val arrowMaxRecordsPerBatch = sqlConf.arrowMaxRecordsPerBatch
+
+  private var stateSocketSocketPort: Int = 0


nit: Probably one of Socket should be Server?

good catch!

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

HeartSaVioR · 2024-08-13T05:23:35Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+ * This class is used to handle the state requests from the Python side. It runs on a separate
+ * thread spawned by TransformWithStateInPandasStateRunner per task. It opens a dedicated socket
+ * to process/transfer state related info which is shut down when task finishes or there's an error
+ * on opening the socket. It run It processes following state requests and return responses to the


nit: It run It processes?

HeartSaVioR · 2024-08-13T05:25:24Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+ * - Requests for managing state variables (e.g. valueState).
+ */
+class TransformWithStateInPandasStateServer(
+    private val stateServerSocket: ServerSocket,


nit: in many cases, having private val in constructor param in class is redundant.

HeartSaVioR · 2024-08-13T05:28:11Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+    private val valueStateMapForTest: mutable.HashMap[String, ValueState[Row]] = null)
+  extends Runnable with Logging {
+  private var inputStream: DataInputStream = _
+  private var outputStream: DataOutputStream = outputStreamForTest


outputStreamForTest <= is this really used? We always assign the output stream from run()

OK so we do not call run() when testing...

HeartSaVioR · 2024-08-13T07:11:03Z

python/pyspark/worker.py

    elif eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE:
        return args_offsets, wrap_grouped_map_pandas_udf_with_state(func, return_type)
+    elif eval_type == PythonEvalType.SQL_TRANSFORM_WITH_STATE_PANDAS_UDF:
+        argspec = inspect.getfullargspec(chained_func)  # signature was lost when wrapping it


doesn't seem to be used anywhere, blindly copied?

HeartSaVioR · 2024-08-13T07:34:43Z

python/pyspark/sql/pandas/group_ops.py

+        ...         count = 0
+        ...         exists = self.num_violations_state.exists()
+        ...         if exists:
+        ...             existing_violations_pdf = self.num_violations_state.get()


What's the expectation of the type of this state "value"? From the variable name pdf and also the way we get the number, I suspect this to be a pandas DataFrame, while the right type should be Row.

HeartSaVioR · 2024-08-13T07:39:12Z

python/pyspark/sql/pandas/group_ops.py

+        ...             new_violations += violations_pdf.count().get('temperature')
+        ...         updated_violations = new_violations + existing_violations
+        ...         self.num_violations_state.update((updated_violations,))
+        ...         yield pd.DataFrame({'id': key, 'count': count})


I guess the explanation is to produce the number of violations instead of the number of inputs. This doesn't follow the explanation.

HeartSaVioR · 2024-08-13T07:40:28Z

python/pyspark/sql/pandas/group_ops.py

+        +---+-----+
+        | id|count|
+        +---+-----+
+        |  0|    2|


Isn't the desired output (0, 1), (1, 1)?

HeartSaVioR · 2024-08-13T07:55:52Z

python/pyspark/sql/pandas/serializers.py

+
+    def dump_stream(self, iterator, stream):
+        """
+        Read through an iterator of (iterator of pandas DataFram), serialize them to Arrow


nit: DataFrame

HeartSaVioR

First pass.

HeartSaVioR · 2024-08-13T13:15:10Z

python/pyspark/sql/streaming/stateful_processor.py

+
+class ValueState:
+    """
+    Class used for arbitrary stateful operations with the v2 API to capture single value state.


We should not call transformWithState as v2 API as only few people would know what is v2. Please call it by the name.

HeartSaVioR · 2024-08-13T13:18:12Z

python/pyspark/sql/streaming/stateful_processor.py

+        """
+        return self._value_state_client.exists(self._state_name)
+
+    def get(self) -> Any:


Again, we expect Row as state value, not a pandas DataFrame. Please let me know if you are proposing pandas DataFrame for better suit for more state types.

HeartSaVioR · 2024-08-13T13:20:14Z

python/pyspark/sql/streaming/stateful_processor.py

+
+class StatefulProcessorHandle:
+    """
+    Represents the operation handle provided to the stateful processor used in the arbitrary state


nit: transformWithState

HeartSaVioR · 2024-08-13T13:28:25Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+    def remove_implicit_key(self) -> None:
+        import pyspark.sql.streaming.StateMessage_pb2 as stateMessage
+
+        print("calling remove_implicit_key on python side")


debugging purpose, or intentionally left for future debug context?

HeartSaVioR · 2024-08-13T13:34:58Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+        if status == 0:
+            self.handle_state = state
+        else:
+            raise PySparkRuntimeError(f"Error setting handle state: " f"{response_message[1]}")


I see we just match all errors here to PySparkRuntimeError with error message (no classification) - shall we revisit the Scala codebase and ensure we give the same error class for the same error?

Also there are internal requests vs user side requests. For example, I don't expect users to call set_implicit_key by themselves (so errors from them are internal errors), but expect users to call get_value_state (so error could be either user facing and internal). The classification of error class has to be different for these cases.

HeartSaVioR · 2024-08-13T13:47:15Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+        df = self.spark.readStream.format("text").option("maxFilesPerTrigger", 1).load(input_path)
+        df_split = df.withColumn("split_values", split(df["value"], ","))
+        df_split = df_split.select(
+            df_split.split_values.getItem(0).alias("id"),


Would adding cast here instead of having withColumn in L84 work?

HeartSaVioR · 2024-08-13T13:52:18Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+
+        self.assertEqual(q.name, "this_query")
+        self.assertTrue(q.isActive)
+        q.processAllAvailable()


+1 Shall we ensure the query to be stopped instead of relying on other test to stop leaking query?

HeartSaVioR · 2024-08-13T13:53:54Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+
+        self._test_transform_with_state_in_pandas_basic(SimpleStatefulProcessor(), check_results)
+
+    def test_transform_with_state_in_pandas_sad_cases(self):


nit: shall we be explicit a bit for what is the bad case? method name is test name.

HeartSaVioR · 2024-08-13T13:57:22Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+        )
+
+    def test_transform_with_state_in_pandas_query_restarts(self):
+        input_path = tempfile.mkdtemp()


While we are using three different sub-directories, shall we call this out as root_path and create a subdirectory input explicitly?

HeartSaVioR · 2024-08-13T14:03:33Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+            existing_violations = 0
+        for pdf in rows:
+            pdf_count = pdf.count()
+            count += pdf_count.get("temperature")


Same for the API doc example - any reason we count the inputs and count the number of violations separately?

HeartSaVioR

Only minors which could be also deferred to TODO JIRA ticket(s).

HeartSaVioR · 2024-08-14T12:18:25Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+        if (valueStates(stateName)._1.exists()) {
          sendResponse(0)
        } else {
          sendResponse(1, s"state $stateName doesn't exist")


How do we distinguish the case of "no value state is defined for the state variable name" vs "the value state is defined but not having a value yet" if we use the same status code?

HeartSaVioR · 2024-08-14T12:20:38Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

-          sendResponse(1, s"state $stateName doesn't exist")
-        }
+        val valueRow = PythonSQLUtils.toJVMRow(byteArray, valueStateTuple._2, valueStateTuple._3)
+        valueStates(stateName)._1.update(valueRow)


nit: valueStateTuple

HeartSaVioR · 2024-08-14T12:21:17Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala


-  private def sendResponse(status: Int, errorMessage: String = null): Unit = {
+  private def sendResponse(
+    status: Int,


nit: 2 more spaces (while we are here)

HeartSaVioR · 2024-08-14T12:26:17Z

python/pyspark/sql/streaming/stateful_processor.py

-    def get(self) -> Any:
-        import pandas as pd
-
+    def get(self) -> Row:


nit: Optional[Row]?

HeartSaVioR · 2024-08-14T12:28:17Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

        status = response_message[0]
        if status != 0:
-            raise PySparkRuntimeError(f"Error initializing value state: " f"{response_message[1]}")
+            raise PySparkRuntimeError(


Shall we give a better error class as it's user facing error? You can revert back and file a JIRA ticket for this as well to defer the change.

I'd expect having dedicated error class, if Scala version of the implementation uses the error class then use the same, otherwise define a new one.

HeartSaVioR · 2024-08-14T12:31:43Z

python/pyspark/sql/streaming/value_state_client.py

            return True
-        elif status == 1:
-            # server returns 1 if the state does not exist
+        elif status == 1 and "doesn't exist" in response_message[1]:


I'd recommend to use the different status code instead of parsing. Please consider the change relying on string/hardcode to be unacceptable except specific needs.

HeartSaVioR · 2024-08-14T12:34:38Z

python/pyspark/sql/streaming/value_state_client.py

        else:
            raise PySparkRuntimeError(
-                f"Error checking value state exists: " f"{response_message[1]}"
+                errorClass="CALL_BEFORE_INITIALIZE",


ditto, explicitly define a dedicated error class

HeartSaVioR · 2024-08-14T12:35:08Z

python/pyspark/sql/streaming/value_state_client.py

+            return row
        else:
-            raise PySparkRuntimeError(f"Error getting value state: {response_message[1]}")
+            raise PySparkRuntimeError(


ditto, probably the same error class with above

HeartSaVioR · 2024-08-14T12:35:20Z

python/pyspark/sql/streaming/value_state_client.py

        status = response_message[0]
        if status != 0:
-            raise PySparkRuntimeError(f"Error updating value state: " f"{response_message[1]}")
+            raise PySparkRuntimeError(


ditto, same error class as above

HeartSaVioR · 2024-08-14T12:35:35Z

python/pyspark/sql/streaming/value_state_client.py

        status = response_message[0]
        if status != 0:
-            raise PySparkRuntimeError(f"Error clearing value state: " f"{response_message[1]}")
+            raise PySparkRuntimeError(


ditto, same error class as above

HeartSaVioR

+1

Please leave a comment listing all JIRA tickets for TODOs, for record/reference.

HeartSaVioR · 2024-08-15T13:24:13Z

I'm going to merge as we have TODO tickets and all others look OK.

Thanks! Merging to master.

bogao007 · 2024-08-15T17:34:43Z

+1

Please leave a comment listing all JIRA tickets for TODOs, for record/reference.

Thanks a lot @HeartSaVioR! Here are the TODOs related to this PR:
https://issues.apache.org/jira/browse/SPARK-49233
https://issues.apache.org/jira/browse/SPARK-49100
https://issues.apache.org/jira/browse/SPARK-49212

… module ### What changes were proposed in this pull request? This pr makes the following changes to the `maven-shade-plugin` rules for the `sql/core` module: 1. To avoid being influenced by the parent `pom.xml`, use `combine.self = "override"` in the `<configuration>` of the `maven-shade-plugin` for the `sql/core` module. Before this configuration was added, the relocation result was incorrect, and `protobuf-java` was not relocated. We can unzip the packaging result to confirm this issue. We can use IntelliJ's "Show Effective POM" feature to view the result of this parameter, the result is equivalent to the effective POM log with --debug printing added during the Maven compilation: **Before** <img width="828" alt="image" src="https://github.com/user-attachments/assets/0bce810f-57e9-4a50-9fa2-b6063e040a29"> We can see that an unexpected ``` <includes> <include>org.eclipse.jetty.**</include> </includes> ``` has been added to the relocation rule. **After** <img width="787" alt="image" src="https://github.com/user-attachments/assets/0fab3422-2da7-4b8f-bd7f-9357fcdc39c2"> We can see that the extra `<includes>` in the relocation rule is no longer present. 2. Before SPARK-48755 | #47133 overwrote the `maven-shade-plugin` rules for `sql/core`, it inherited the rules from the parent `pom.xml` and shaded `org.spark-project.spark:unused`. This behavior changed after SPARK-48755, so this pr restores it. 3. The relocation rules for Guava should be retained and follow the configuration in the parent `pom.xml`, which relocates `com.google.common` to `${spark.shade.packageName}.guava`. This PR restores this configuration. 4. For `protobuf-java`, which is under the `com.google.protobuf` package, the already shaded `protobuf-java` in the `core` module can be reused instead of shading it again in `sql/core` module. Therefore, this pr only configures the corresponding relocation rule for it: `com.google.protobuf` -> `${spark.shade.packageName}.spark_core.protobuf`. 5. Regarding the `ServicesResourceTransformer` configuration, it is used to merge `META-INF/services` resources. This is not needed for Guava and `protobuf-java`, so this pr removes it. ### Why are the changes needed? Fix shade and relocation rule of `sql/core` module ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Aciton - Manually inspect the packaging result: Extract `spark-sql_2.13-4.0.0-SNAPSHOT.jar` to a separate directory, then execute `grep "org.sparkproject.guava" -R *` and `grep "org.sparkproject.spark_core.protobuf" -R *` to confirm the successful relocation. - Maven test passed: https://github.com/LuciferYang/spark/runs/32278520082 <img width="960" alt="image" src="https://github.com/user-attachments/assets/5435b2ff-3785-4413-83d9-190c16c6ba75"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #48675 from LuciferYang/sql-core-shade. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…ithState` ### What changes were proposed in this pull request? This follow-ups for #47133 to add missing API ref docs ### Why are the changes needed? Provide proper API ref doc for `transformWithState` ### Does this PR introduce _any_ user-facing change? No API changes but only the user-facing API ref docs will include the new API ### How was this patch tested? The existing doc build in CI should pass ### Was this patch authored or co-authored using generative AI tooling? No Closes #48840 from itholic/SPARK-48755-followup. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

sahnib and others added 7 commits June 18, 2024 10:48

[WIP] Poc.

b1175e4

Introduce Protobuf.

0a98ed8

Fixing things.

8e2b193

support timeMode for python state v2 API

16e4c17

Add protobuf for serde

92ef716

protobuf change

c3eaf38

Initial commit

609d94e

github-actions bot added SQL STRUCTURED STREAMING BUILD CORE PYTHON labels Jun 27, 2024

bogao007 commented Jun 27, 2024

View reviewed changes

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala Outdated Show resolved Hide resolved

bogao007 commented Jun 27, 2024

View reviewed changes

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala Outdated Show resolved Hide resolved

bogao007 commented Jun 27, 2024

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/streaming/util/TransformWithStateInPandasSuite.scala Outdated Show resolved Hide resolved

bogao007 commented Jun 27, 2024

View reviewed changes

python/pyspark/sql/pandas/group_ops.py Outdated Show resolved Hide resolved

bogao007 changed the title ~~State V2 base implementation and ValueState support~~ [SPARK-48755] State V2 base implementation and ValueState support Jun 28, 2024

better error handling, support value state with different types

a27f9d9

sahnib reviewed Jul 2, 2024

View reviewed changes

anishshri-db reviewed Jul 2, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala Outdated Show resolved Hide resolved

bogao007 added 3 commits July 3, 2024 14:47

addressed comments

684939b

fix

7f65fbd

Added support for unix domain socket

c25d7da

sahnib reviewed Jul 16, 2024

View reviewed changes

ericm-db reviewed Jul 16, 2024

View reviewed changes

anishshri-db reviewed Jul 16, 2024

View reviewed changes

python/pyspark/sql/pandas/group_ops.py Outdated Show resolved Hide resolved

anishshri-db reviewed Jul 16, 2024

View reviewed changes

python/pyspark/sql/streaming/StateMessage_pb2.pyi Show resolved Hide resolved

HyukjinKwon reviewed Aug 11, 2024

View reviewed changes

python/pyspark/sql/streaming/stateful_processor.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Aug 11, 2024

View reviewed changes

address comments

c7b0a4f

HeartSaVioR changed the title ~~[SPARK-48755][SS] State V2 base implementation and ValueState support~~ [SPARK-48755][SS][PYTHON] transformWithState pyspark base implementation and ValueState support Aug 12, 2024

HeartSaVioR reviewed Aug 12, 2024

View reviewed changes

address comments

c80b292

HeartSaVioR reviewed Aug 13, 2024

View reviewed changes

bogao007 added 2 commits August 13, 2024 20:57

address comments

81276f3

fix lint

5886b5c

HeartSaVioR reviewed Aug 14, 2024

View reviewed changes

bogao007 added 3 commits August 14, 2024 09:35

fix lint

23e54b4

address comments

2ba4fd0

fix test

2a9c20b

HeartSaVioR approved these changes Aug 15, 2024

View reviewed changes

HeartSaVioR closed this in def42d4 Aug 15, 2024

LuciferYang mentioned this pull request Oct 30, 2024

[SPARK-50166][SQL][BUILD] Fix shade and relocation rule of sql/core module #48675

Closed

itholic mentioned this pull request Nov 14, 2024

[SPARK-48755][DOCS][PYTHON][FOLLOWUP] Add PySpark doc for transformWithState #48840

Closed

bogao007 mentioned this pull request Nov 18, 2024

[SPARK-50341][SS][PYTHON] Use UDS for JVM and Python worker communication #48884

Closed


		val outputIterator = executePython(data, output, runner)

		CompletionIterator[InternalRow, Iterator[InternalRow]](outputIterator, {


		self._test_transform_with_state_in_pandas_basic(SimpleStatefulProcessor(), check_results)

		def test_transform_with_state_in_pandas_sad_cases(self):

[SPARK-48755][SS][PYTHON] transformWithState pyspark base implementation and ValueState support #47133

[SPARK-48755][SS][PYTHON] transformWithState pyspark base implementation and ValueState support #47133

Uh oh!

Conversation

bogao007 commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jun 28, 2024

Uh oh!

bogao007 commented Jun 28, 2024

Uh oh!

sahnib left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sahnib left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

bogao007 commented Aug 12, 2024

Uh oh!

HyukjinKwon commented Aug 12, 2024

Uh oh!

HeartSaVioR commented Aug 12, 2024

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

bogao007 commented Jun 27, 2024 •

edited

Loading