Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Jun 9, 2023

What changes were proposed in this pull request?

SparkConnectStreamHandler treats the proto requests from connect client and send the responses back to connect client. SparkConnectStreamHandler holds a component StreamObserver to send responses. Currently, SparkConnectPlanner also knows StreamObserver passed by method parameters. So the behavior introduces some issues.

  • Visibility

SparkConnectStreamHandler holds the StreamObserver and shouldn't expose it. Now, we expose StreamObserver to SparkConnectPlanner and the latter is public, so every developers could use it.

  • Safety

Based on the visibility issue, this is usually not a secure strategy.

  • Code Design

Because SparkConnectStreamHandler is fully responsible for responding to RPC, which is a division of responsibilities in good programming. The better code design, the easier to extend function.

So I think we should keep the StreamObserver could be accessed only with SparkConnectStreamHandler.
This PR wraps the detail of StreamObserver into SparkConnectStreamHandler, SparkConnectPlanner only need call sendResponse if the response is ready.

This PR want decouple the process handle commands and the other process send responses on server side.

Note: As we disscussed on #41379, this PR doesn't delay to send any response "right now" and expects that it will be returned.

Why are the changes needed?

Decouple handle command and send response on server side.

Does this PR introduce any user-facing change?

'No'.
Just update the inner implementation.

How was this patch tested?

Exists test cases.

@beliefer
Copy link
Contributor Author

beliefer commented Jun 9, 2023

@beliefer
Copy link
Contributor Author

ping @hvanhovell @grundprinzip

@beliefer beliefer force-pushed the SPARK-43879_new2 branch 3 times, most recently from a2c55a7 to defbd2f Compare June 28, 2023 07:19
@beliefer
Copy link
Contributor Author

@grundprinzip Could you take a look? I have rebased many times.

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall approach is fine. I don't think this is the most burning refactoring that needs to be done, but it's ok.

However, there are a couple of things that we need to make sure. The APIs of the planner change their visibility and we need to make sure that this is really necessary.

In addition there is the question if the planner can really exist without response handler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this private[connect]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a valid use-case where this can be empty? Personally, I don't think that this should ever be none. I understand for testing purposes one might inject a mock, but otherwise the planner cannot exist without the responsehandler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists some test case only need transform proto to LogicalPlan and doesn't depend on responsehandler.

Please refer:

val planner = new SparkConnectPlanner(SessionHolder.forTesting(spark))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove the private? This should at least be package private.

@beliefer
Copy link
Contributor Author

@grundprinzip Thank you for review.

@beliefer
Copy link
Contributor Author

beliefer commented Jun 29, 2023

The APIs of the planner change their visibility and we need to make sure that this is really necessary.

Yes. Because connect exposes the StreamObserver to users is dangerous, the refactor hide it into SparkConnectResponseHandler.
Although it seems a little break change. I think it's worth to protect it.

@beliefer
Copy link
Contributor Author

In addition there is the question if the planner can really exist without response handler.

Spark Connect exists test case uses the planner without response handler. Could we add response handler for it ?

@beliefer beliefer force-pushed the SPARK-43879_new2 branch from 3e56f84 to f66d454 Compare July 4, 2023 02:42
@beliefer
Copy link
Contributor Author

beliefer commented Jul 4, 2023

ping @grundprinzip cc @hvanhovell

@grundprinzip
Copy link
Contributor

Hey @beliefer there are some inflight changes that will conflict heavily with this work - #41315 The changes are needed for the asynchronous query execution. Can we pause this PR for a moment until the other changes are in so that we have a good understanding of how they can work together?

cc @juliuszsompolski

@juliuszsompolski
Copy link
Contributor

juliuszsompolski commented Jul 5, 2023

@beliefer In what I've been working on:

Code Design

I do plan to have SparkConnectStreamHandler be responsible for sending the RPCs. That's not yet ready in my PR, in the current iteration I had the execution thread send the RPCs, but I need to hand it over back to the SparkConnectStreamHandler thread. I agree with you that it's good division of responsibilities, especially since my followup work is that I want to be able to "reconnect" to a query after the initial ExecutePlan disconnects, so I need to have another RPC call to reconnect to the stream.

Visibility

I avoid the need for the refactoring by instead passing my own class implementing the StreamObserver interface that is just going to notify the "real" StreamObserver that there is something to be sent. This way I avoid the need for refactoring, and the StreamObserver is fine as an interface to be passed around.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 14, 2023
@github-actions github-actions bot closed this Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants