Skip to content

Conversation

@peterpashkin
Copy link

@peterpashkin peterpashkin commented Apr 16, 2025

What changes were proposed in this pull request?

Analyze Plan Requests for Schema should not trigger an Execute on the Logical Plan, currently when sending an AnalyzePlanRequest with a command that gets executed eagerly the Dataset.ofRows(logicalPlan) call executes the underlying command. We do not want this to happen when doing AnalyzePlan. So instead we construct the LogicalPlan with the CommandExecutionMode.SKIP and return the resulting schema that way.
https://issues.apache.org/jira/browse/SPARK-51818

Why are the changes needed?

SQL commands that get sent via an AnalyzePlanRequest get executed eagerly right now, this PR fixes that

Does this PR introduce any user-facing change?

When calling .schema on DataFrame via Spark Connect the plan saved in the DataFrame is not executed anymore, that was the case beforehand. Example: spark.newDataFrame(plan: proto.Plan).schema with plan encoding some SQL command that gets executed eagerly like DROP TABLE the current behavior would execute the SQL command. This will not happen anymore after this change.

How was this patch tested?

Added Test for sending an AnalyzePlanRequest with Drop Table and making sure the table was not dropped

Was this patch authored or co-authored using generative AI tooling?

No

@peterpashkin peterpashkin changed the title Move QueryExecution creation to AnalyzeHandler and don't Execute for AnalyzePlanRequests [SPARK-51818] Move QueryExecution creation to AnalyzeHandler and don't Execute for AnalyzePlanRequests Apr 16, 2025
@peterpashkin peterpashkin changed the title [SPARK-51818] Move QueryExecution creation to AnalyzeHandler and don't Execute for AnalyzePlanRequests [SPARK-51818][CONNECT] Move QueryExecution creation to AnalyzeHandler and don't Execute for AnalyzePlanRequests Apr 16, 2025
empty

formatting

test if that is good

typo fix

only explain change

isLocal execution

okay only Schema

just work please

do all correct

include

small fix

squash
@peterpashkin peterpashkin force-pushed the peter-pashkin/MoveAnalyzeAndSkipExecution branch from 9a092f3 to 45bf4db Compare April 16, 2025 15:25
Peter Pashkin added 2 commits April 16, 2025 16:06
@HyukjinKwon
Copy link
Member

cc @vicennial @hvanhovell FYI

Copy link
Contributor

@vicennial vicennial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!
The fact that analysis calls were executing commands looks like an unfortunate bug that slipped through.

Requests:

  • Please fill in the "User Facing Changes" section. Include a simple example where the behaviour differs (IIUC, spark.sql("<some command>").schema would now have it's behaviour corrected)
  • Add a server-side test that verifies that the DF was in fact, not executed

@peterpashkin
Copy link
Author

Thanks Akhil, will add the tests and User Facing Changes. Actually spark.sql(...).schema is still executing the command because commands with .sql() will get executed eagerly. This PR only fixes sending AnalyzePlanRequests not executing the request, like session.analyze(plan).

@peterpashkin peterpashkin requested a review from vicennial April 28, 2025 07:52
Copy link
Contributor

@vicennial vicennial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@peterpashkin peterpashkin requested a review from hvanhovell May 2, 2025 15:43
@HyukjinKwon
Copy link
Member

Merged to master.

yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
… and don't Execute for AnalyzePlanRequests

### What changes were proposed in this pull request?

Analyze Plan Requests for Schema should not trigger an Execute on the Logical Plan, currently when sending an AnalyzePlanRequest with a command that gets executed eagerly the Dataset.ofRows(logicalPlan) call executes the underlying command. We do not want this to happen when doing AnalyzePlan. So instead we construct the LogicalPlan with the CommandExecutionMode.SKIP and return the resulting schema that way.
https://issues.apache.org/jira/browse/SPARK-51818

### Why are the changes needed?

SQL commands that get sent via an AnalyzePlanRequest get executed eagerly right now, this PR fixes that

### Does this PR introduce _any_ user-facing change?

When calling .schema on DataFrame via Spark Connect the plan saved in the DataFrame is not executed anymore, that was the case beforehand. Example: spark.newDataFrame(plan: proto.Plan).schema with plan encoding some SQL command that gets executed eagerly like DROP TABLE the current behavior would execute the SQL command. This will not happen anymore after this change.

### How was this patch tested?

Added Test for sending an AnalyzePlanRequest with Drop Table and making sure the table was not dropped

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50605 from peterpashkin/peter-pashkin/MoveAnalyzeAndSkipExecution.

Authored-by: Peter Pashkin <peter.pashkin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants