[SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. #38300

grundprinzip · 2022-10-18T20:26:54Z

What changes were proposed in this pull request?

This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect.

Why are the changes needed?

Cleanup

Does this PR introduce any user-facing change?

No / Experimental

How was this patch tested?

E2E tests for the Python Client.

grundprinzip · 2022-10-19T17:39:41Z

@cloud-fan @amaliujia @HyukjinKwon If you have some cylces, it would be great if you could review.

AmplabJenkins · 2022-10-19T18:38:42Z

Can one of the admins verify this patch?

grundprinzip · 2022-10-19T19:30:16Z

I will create a proper JIRA if the patch makes sense to push.

amaliujia · 2022-10-19T20:11:30Z

I will take a look by the end of the day.

amaliujia · 2022-10-20T05:18:35Z

Looks reasonable.

Only one question: the CSV batch was producing pandas dataframe directly on the python client side. Will this the same for JSON batch or it will give List[Row]?

grundprinzip · 2022-10-20T05:37:28Z

Looks reasonable.

Only one question: the CSV batch was producing pandas dataframe directly on the python client side. Will this the same for JSON batch or it will give List[Row]?

If you look at the python changes it simply changes the calling function from Pandas. In both cases we rely on pandas to do the proper deserialization for now.

Once we have arrow IPC batches we can / should look into producing the pandas df and the list of rows.

For now there is no public facing change.

.../connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala

HyukjinKwon

I am fine w/ this. Probably it'd be great if we can avoid toLocalIterator though.

HyukjinKwon · 2022-10-21T11:49:11Z

All related tests passed.

Merged to master.

amaliujia · 2022-10-21T18:03:04Z

Thanks for working on this!

…ollection ### What changes were proposed in this pull request? This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect. ### Why are the changes needed? Cleanup ### Does this PR introduce _any_ user-facing change? No / Experimental ### How was this patch tested? E2E tests for the Python Client. Closes apache#38300 from grundprinzip/spark-json. Lead-authored-by: Martin Grund <martin.grund@databricks.com> Co-authored-by: Martin Grund <grundprinzip@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

switching form csv to JSON

37a7e3d

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 18, 2022

format

0489649

grundprinzip marked this pull request as ready for review October 19, 2022 17:39

grundprinzip changed the title ~~[SPARK-XXX] [CONNECT] Use proper JSON encoding until we have Arrow collection.~~ [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. Oct 20, 2022

zhengruifeng reviewed Oct 20, 2022

View reviewed changes

grundprinzip and others added 2 commits October 20, 2022 21:39

fixing the json handling

7910ba0

Merge branch 'apache:master' into spark-json

ae5a941

HyukjinKwon reviewed Oct 21, 2022

View reviewed changes

.../connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Oct 21, 2022

View reviewed changes

grundprinzip force-pushed the spark-json branch from c3f3329 to ae5a941 Compare October 21, 2022 08:52

iterator

413c6ad

HyukjinKwon closed this in 26e258c Oct 21, 2022

zhengruifeng mentioned this pull request Nov 3, 2022

[SPARK-41005][CONNECT][PYTHON] Arrow-based collect #38468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. #38300

[SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. #38300

Uh oh!

grundprinzip commented Oct 18, 2022

Uh oh!

grundprinzip commented Oct 19, 2022

Uh oh!

AmplabJenkins commented Oct 19, 2022

Uh oh!

grundprinzip commented Oct 19, 2022

Uh oh!

amaliujia commented Oct 19, 2022

Uh oh!

amaliujia commented Oct 20, 2022

Uh oh!

grundprinzip commented Oct 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Oct 21, 2022

Uh oh!

amaliujia commented Oct 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. #38300

[SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. #38300

Uh oh!

Conversation

grundprinzip commented Oct 18, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

grundprinzip commented Oct 19, 2022

Uh oh!

AmplabJenkins commented Oct 19, 2022

Uh oh!

grundprinzip commented Oct 19, 2022

Uh oh!

amaliujia commented Oct 19, 2022

Uh oh!

amaliujia commented Oct 20, 2022

Uh oh!

grundprinzip commented Oct 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 21, 2022

Uh oh!

amaliujia commented Oct 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants