[SPARK-49899][PYTHON][SS] Support deleteIfExists for TransformWithStateInPandas #48373

bogao007 · 2024-10-07T18:48:45Z

What changes were proposed in this pull request?

Support deleteIfExists for TransformWithStateInPandas.
Added close() support for StatefulProcessor.

Why are the changes needed?

Add parity to TransformWithStateInPandas for functionalities we support in TransformWithState

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No

Pshak-20000 · 2024-10-24T11:50:38Z

Hi ,

I want to contribute to the project and can help out. Please let me know what to do!

Thanks!

jingz-db · 2024-10-28T18:28:42Z

python/pyspark/sql/pandas/group_ops.py

+            try:
+                yield result
+            finally:
+                statefulProcessor.close()


Shall we set handle state to CLOSE here?

I just realized an issue that this is actually being called after processing each grouping key instead of finishing processing all keys for a microbatch. I'll need to revisit this to see if there's a good way to handle this (I cannot think about a good way to detect if the current key is the last key to process right now), if it's not a quick fix, we can probably exclude it for now and have a followup PR fixing it. cc @HeartSaVioR

Maybe we could try injecting a dummy row at the end of the iterator in writeNextInputToArrowStream indicating all the keys have been processed, but I'll need to do some experiments first.

I don't feel like current interface would give you such information - we'll probably need to have another control message to send the signal from JVM to Python (UDF). I agree this may take time, but probably need to mark it as a blocker so that we address before the release.

jingz-db · 2024-10-28T18:34:25Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

@@ -106,6 +106,30 @@ case class TransformWithStateInPandasExec(
    List.empty
  }

+    // operator specific metrics
+  override def customStatefulOperatorMetrics: Seq[StatefulOperatorCustomMetric] = {


Shall we have some simple tests around custom metrics to ensure this works for python?

Added in python test to verify.

bogao007 · 2024-10-30T22:13:36Z

python/pyspark/worker.py

@@ -1890,6 +1890,14 @@ def process():
            try:
                serializer.dump_stream(out_iter, outfile)
            finally:
+                # Sending a signal to TransformWithState UDF to perform proper cleanup steps.


@HeartSaVioR @jingz-db I made a change for properly calling close() and other cleanup steps, could you help take a look and see if this change makes sense? The change is mainly in this file and group_ops.py. I've verified with both manual test and exiting unit test to confirm this change works as expected. I'll fix the merge conflict issue later, just wanted to get some early feedbacks on this specific change, thanks!

Looks OK to me; I assume you've confirmed that the close method is called.

Yep, confirmed.

LGTM! Thanks for making this change.

jingz-db

LGTM! Also I moved the proto generated py files under a new directory for better project management purpose in my last PR. Run with protoc --proto_path=sql/core/src/main/protobuf/org/apache/spark/sql/execution/streaming --python_out=python/pyspark/sql/streaming/proto --pyi_out=python/pyspark/sql/streaming/proto sql/core/src/main/protobuf/org/apache/spark/sql/execution/streaming/StateMessage.proto to generate the py files under the new directory to resolve the merge conflicts.

bogao007 · 2024-10-31T22:41:45Z

LGTM! Also I moved the proto generated py files under a new directory for better project management purpose in my last PR. Run with protoc --proto_path=sql/core/src/main/protobuf/org/apache/spark/sql/execution/streaming --python_out=python/pyspark/sql/streaming/proto --pyi_out=python/pyspark/sql/streaming/proto sql/core/src/main/protobuf/org/apache/spark/sql/execution/streaming/StateMessage.proto to generate the py files under the new directory to resolve the merge conflicts.

I'll give it a try, thanks @jingz-db!

HeartSaVioR

First pass. Looks great in overall. One major comment about separating metrics change out.

HeartSaVioR · 2024-11-04T06:19:00Z

sql/core/src/main/java/org/apache/spark/sql/execution/streaming/state/StateMessage.java

Friendly reminder: we can remove this now.

HeartSaVioR · 2024-11-04T06:26:27Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

@@ -154,6 +178,7 @@ case class TransformWithStateInPandasExec(
          // by the upstream (consumer) operators in addition to the processing in this operator.
          allUpdatesTimeMs += NANOSECONDS.toMillis(System.nanoTime - updatesStartTimeNs)
          commitTimeMs += timeTakenMs {
+            processorHandle.doTtlCleanup()


Is this a bugfix for existing bug? If then please file a new JIRA ticket and submit a new PR. Let's not mix up with different things.

HeartSaVioR · 2024-11-04T06:38:49Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

@@ -106,6 +106,30 @@ case class TransformWithStateInPandasExec(
    List.empty
  }

+    // operator specific metrics


Same, shall we move the change for metrics (and test) out to separate JIRA ticket and corresponding PR?

HeartSaVioR · 2024-11-04T06:42:28Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

@@ -194,6 +195,10 @@ def test_transform_with_state_in_pandas_query_restarts(self):
        q.awaitTermination(10)
        self.assertTrue(q.exception() is None)

+        # Verify custom metrics. We created 2 value states in this test case and deleted 1 of them.


ditto, let's move this out.

bogao007 · 2024-11-08T00:28:27Z

@HeartSaVioR Addressed your comment, could you help take another look? Thanks!

bogao007 · 2024-11-08T02:37:08Z

Create a new ticket https://issues.apache.org/jira/browse/SPARK-50270 to track the metrics change

HeartSaVioR

+1

HeartSaVioR · 2024-11-08T08:57:03Z

Thanks! Merging to master.

support deleteIfExists

5a2f81d

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Oct 7, 2024

bogao007 added 2 commits October 14, 2024 11:57

added close

6f8a428

cleanup ttl

261b6b9

bogao007 added 5 commits October 24, 2024 15:04

Merge branch 'master' into delete-if-exists

25d7335

Added custom metrics, added more tests

90061ea

lint

18845a6

lint

f105004

Merge branch 'master' into delete-if-exists

3bbde36

jingz-db reviewed Oct 28, 2024

View reviewed changes

bogao007 added 2 commits October 28, 2024 17:58

Added custom metrics test

a9a1d14

Properly invoke close() and other cleanup steps

3601681

github-actions bot added the CORE label Oct 30, 2024

fix

fe42b91

bogao007 commented Oct 30, 2024

View reviewed changes

jingz-db approved these changes Oct 31, 2024

View reviewed changes

HeartSaVioR reviewed Nov 4, 2024

View reviewed changes

Merge branch 'master' into delete-if-exists

2044047

HeartSaVioR approved these changes Nov 8, 2024

View reviewed changes

HeartSaVioR closed this in e4638c8 Nov 8, 2024

[SPARK-49899][PYTHON][SS] Support deleteIfExists for TransformWithStateInPandas #48373

[SPARK-49899][PYTHON][SS] Support deleteIfExists for TransformWithStateInPandas #48373

Uh oh!

Conversation

bogao007 commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Pshak-20000 commented Oct 24, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogao007 Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingz-db left a comment

Choose a reason for hiding this comment

Uh oh!

bogao007 commented Oct 31, 2024

Uh oh!

HeartSaVioR left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogao007 commented Nov 8, 2024

Uh oh!

bogao007 commented Nov 8, 2024

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 8, 2024

Uh oh!

Uh oh!

bogao007 commented Oct 7, 2024 •

edited

Loading

bogao007 Oct 30, 2024 •

edited

Loading

HeartSaVioR left a comment •

edited

Loading