Skip to content

Conversation

@grundprinzip
Copy link
Contributor

What changes were proposed in this pull request?

This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect.

Why are the changes needed?

Cleanup

Does this PR introduce any user-facing change?

No / Experimental

How was this patch tested?

E2E tests for the Python Client.

@grundprinzip grundprinzip marked this pull request as ready for review October 19, 2022 17:39
@grundprinzip
Copy link
Contributor Author

@cloud-fan @amaliujia @HyukjinKwon If you have some cylces, it would be great if you could review.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@grundprinzip
Copy link
Contributor Author

I will create a proper JIRA if the patch makes sense to push.

@amaliujia
Copy link
Contributor

I will take a look by the end of the day.

@amaliujia
Copy link
Contributor

Looks reasonable.

Only one question: the CSV batch was producing pandas dataframe directly on the python client side. Will this the same for JSON batch or it will give List[Row]?

@grundprinzip
Copy link
Contributor Author

Looks reasonable.

Only one question: the CSV batch was producing pandas dataframe directly on the python client side. Will this the same for JSON batch or it will give List[Row]?

If you look at the python changes it simply changes the calling function from Pandas. In both cases we rely on pandas to do the proper deserialization for now.

Once we have arrow IPC batches we can / should look into producing the pandas df and the list of rows.

For now there is no public facing change.

@grundprinzip grundprinzip changed the title [SPARK-XXX] [CONNECT] Use proper JSON encoding until we have Arrow collection. [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. Oct 20, 2022
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine w/ this. Probably it'd be great if we can avoid toLocalIterator though.

@HyukjinKwon
Copy link
Member

All related tests passed.

Merged to master.

@amaliujia
Copy link
Contributor

Thanks for working on this!

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…ollection

### What changes were proposed in this pull request?
This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect.

### Why are the changes needed?
Cleanup

### Does this PR introduce _any_ user-facing change?
No / Experimental

### How was this patch tested?
E2E tests for the Python Client.

Closes apache#38300 from grundprinzip/spark-json.

Lead-authored-by: Martin Grund <martin.grund@databricks.com>
Co-authored-by: Martin Grund <grundprinzip@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants