-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. #38300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan @amaliujia @HyukjinKwon If you have some cylces, it would be great if you could review. |
|
Can one of the admins verify this patch? |
|
I will create a proper JIRA if the patch makes sense to push. |
|
I will take a look by the end of the day. |
|
Looks reasonable. Only one question: the CSV batch was producing pandas dataframe directly on the python client side. Will this the same for JSON batch or it will give List[Row]? |
If you look at the python changes it simply changes the calling function from Pandas. In both cases we rely on pandas to do the proper deserialization for now. Once we have arrow IPC batches we can / should look into producing the pandas df and the list of rows. For now there is no public facing change. |
.../connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Show resolved
Hide resolved
.../connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Show resolved
Hide resolved
.../connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Show resolved
Hide resolved
.../connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Outdated
Show resolved
Hide resolved
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine w/ this. Probably it'd be great if we can avoid toLocalIterator though.
c3f3329 to
ae5a941
Compare
|
All related tests passed. Merged to master. |
|
Thanks for working on this! |
…ollection ### What changes were proposed in this pull request? This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect. ### Why are the changes needed? Cleanup ### Does this PR introduce _any_ user-facing change? No / Experimental ### How was this patch tested? E2E tests for the Python Client. Closes apache#38300 from grundprinzip/spark-json. Lead-authored-by: Martin Grund <martin.grund@databricks.com> Co-authored-by: Martin Grund <grundprinzip@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect.
Why are the changes needed?
Cleanup
Does this PR introduce any user-facing change?
No / Experimental
How was this patch tested?
E2E tests for the Python Client.