-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-51818][CONNECT] Move QueryExecution creation to AnalyzeHandler and don't Execute for AnalyzePlanRequests #50605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
empty formatting test if that is good typo fix only explain change isLocal execution okay only Schema just work please do all correct include small fix squash
9a092f3 to
45bf4db
Compare
|
cc @vicennial @hvanhovell FYI |
vicennial
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
The fact that analysis calls were executing commands looks like an unfortunate bug that slipped through.
Requests:
- Please fill in the "User Facing Changes" section. Include a simple example where the behaviour differs (IIUC,
spark.sql("<some command>").schemawould now have it's behaviour corrected) - Add a server-side test that verifies that the DF was in fact, not executed
|
Thanks Akhil, will add the tests and User Facing Changes. Actually spark.sql(...).schema is still executing the command because commands with .sql() will get executed eagerly. This PR only fixes sending AnalyzePlanRequests not executing the request, like session.analyze(plan). |
vicennial
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
.../server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala
Show resolved
Hide resolved
.../server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala
Outdated
Show resolved
Hide resolved
|
Merged to master. |
… and don't Execute for AnalyzePlanRequests ### What changes were proposed in this pull request? Analyze Plan Requests for Schema should not trigger an Execute on the Logical Plan, currently when sending an AnalyzePlanRequest with a command that gets executed eagerly the Dataset.ofRows(logicalPlan) call executes the underlying command. We do not want this to happen when doing AnalyzePlan. So instead we construct the LogicalPlan with the CommandExecutionMode.SKIP and return the resulting schema that way. https://issues.apache.org/jira/browse/SPARK-51818 ### Why are the changes needed? SQL commands that get sent via an AnalyzePlanRequest get executed eagerly right now, this PR fixes that ### Does this PR introduce _any_ user-facing change? When calling .schema on DataFrame via Spark Connect the plan saved in the DataFrame is not executed anymore, that was the case beforehand. Example: spark.newDataFrame(plan: proto.Plan).schema with plan encoding some SQL command that gets executed eagerly like DROP TABLE the current behavior would execute the SQL command. This will not happen anymore after this change. ### How was this patch tested? Added Test for sending an AnalyzePlanRequest with Drop Table and making sure the table was not dropped ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50605 from peterpashkin/peter-pashkin/MoveAnalyzeAndSkipExecution. Authored-by: Peter Pashkin <peter.pashkin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Analyze Plan Requests for Schema should not trigger an Execute on the Logical Plan, currently when sending an AnalyzePlanRequest with a command that gets executed eagerly the Dataset.ofRows(logicalPlan) call executes the underlying command. We do not want this to happen when doing AnalyzePlan. So instead we construct the LogicalPlan with the CommandExecutionMode.SKIP and return the resulting schema that way.
https://issues.apache.org/jira/browse/SPARK-51818
Why are the changes needed?
SQL commands that get sent via an AnalyzePlanRequest get executed eagerly right now, this PR fixes that
Does this PR introduce any user-facing change?
When calling .schema on DataFrame via Spark Connect the plan saved in the DataFrame is not executed anymore, that was the case beforehand. Example: spark.newDataFrame(plan: proto.Plan).schema with plan encoding some SQL command that gets executed eagerly like DROP TABLE the current behavior would execute the SQL command. This will not happen anymore after this change.
How was this patch tested?
Added Test for sending an AnalyzePlanRequest with Drop Table and making sure the table was not dropped
Was this patch authored or co-authored using generative AI tooling?
No