-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40857] [CONNECT] Enable configurable GPRC Interceptors. #38320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
connector/connect/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
Show resolved
Hide resolved
...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala
Outdated
Show resolved
Hide resolved
...or/connect/src/test/scala/org/apache/spark/sql/connect/service/InterceptorRegistryTest.scala
Show resolved
Hide resolved
...or/connect/src/test/scala/org/apache/spark/sql/connect/service/InterceptorRegistryTest.scala
Show resolved
Hide resolved
|
LGTM |
...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala
Outdated
Show resolved
Hide resolved
...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala
Outdated
Show resolved
Hide resolved
...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala
Outdated
Show resolved
Hide resolved
...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala
Outdated
Show resolved
Hide resolved
...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala
Outdated
Show resolved
Hide resolved
|
thanks, merging to master! |
| * added to the GRPC server in order of their position in the list. Once the statically compiled | ||
| * interceptors are added, dynamically configured interceptors are added. | ||
| */ | ||
| object SparkConnectInterceptorRegistry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an API? We should mark @Unstable and added version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it has to be private[service]. Or we should at least mention what are supposed to be an API at src/main/scala/org/apache/spark/sql/connect/package.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought everything under org.apache.spark.sql.connect.service is private, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not .. unless we document so ... Should probably either document it like https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala or explicitly make it private/public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's document it, just like what catalyst does. cc @amaliujia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Let me check places that is intentional to be private or internal API but are documented. I can follow up on it.
| .intConf | ||
| .createWithDefault(15002) | ||
|
|
||
| val CONNECT_GRPC_INTERCEPTOR_CLASSES = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that this is under apach.spark.sql. ... we should either move this module out of sql or use StaticSQLConf ideally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the connect module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's under apache.spark.sql.connect. Should probably move it to apache.spark.connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, everything in Spark connect is under apache.spark.sql.connect. Are you proposing a overall package renaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am proposing before it's too late.
Either: if we target to cover other components too, should probably rename them before it's too late. For PySpark too, should probably move it from pyspark.sql.connect.DataFrame -> pyspark.connect.sql.DataFrame.
Or, use StaticSQLConf since we're in SQL package. We're doing this in Hive thirft server, Hive modules, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should better use StaticSQLConf or SQLConf instead of SparkConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that it's mixed. It's in SQL package but the configuration being used is SparkConf. The configuration name doesn't follow it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StaticSQLConf and SQLConf are in the SQL module, it's weird to add spark connect configs there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what Hive thriftserver (separate module) dose. Avro (separate module) and Kafka (separate module for Structured Streaming) do. Pandas API on Spark also leverages runtime configurations via SparkSession under the hood instead of SparkConf.
This is weird that it's a SQL thing uses SQL package namespace but it doesn't use SQLConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the benefit of doing so. Config definition can be anywhere and we can still use SQLConf to access it. e.g. SQLConf.get.getConf(CONNECT_GRPC_INTERCEPTOR_CLASSES).
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. couple of comments
### What changes were proposed in this pull request?
To be able to modify the incoming requests for the Spark Connect GRPC service, for example to be able to translate metadata from the HTTP/2 request to values in the proto message the GRPC service needs to be configured using an interceptor.
This patch adds two ways to configure interceptors for the GRPC service. First, we can now configure interceptors in the `SparkConnectInterceptorRegistry` by adding a value to the `interceptorChain` like in the example below:
```
object SparkConnectInterceptorRegistry {
// Contains the list of configured interceptors.
private lazy val interceptorChain: Seq[InterceptorBuilder] = Seq(
interceptor[LoggingInterceptor](classOf[LoggingInterceptor])
)
// ...
}
```
The second way to configure interceptors is by configuring them using Spark configuration values at startup. Therefore a new config key has been added called: `spark.connect.grpc.interceptor.classes`. This config value contains a comma-separated list of classes that are added as interceptors to the system.
```
./bin/pyspark --conf spark.connect.grpc.interceptor.classes=com.my.important.LoggingInterceptor
```
During startup all of the interceptors are added in order to the `NettyServerBuilder`.
```
// Add all registered interceptors to the server builder.
SparkConnectInterceptorRegistry.chainInterceptors(sb)
```
### Why are the changes needed?
Provide a configurable and extensible way to configure interceptors.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit Tests
Closes apache#38320 from grundprinzip/SPARK-40857.
Lead-authored-by: Martin Grund <martin.grund@databricks.com>
Co-authored-by: Martin Grund <grundprinzip@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
To be able to modify the incoming requests for the Spark Connect GRPC service, for example to be able to translate metadata from the HTTP/2 request to values in the proto message the GRPC service needs to be configured using an interceptor.
This patch adds two ways to configure interceptors for the GRPC service. First, we can now configure interceptors in the
SparkConnectInterceptorRegistryby adding a value to theinterceptorChainlike in the example below:The second way to configure interceptors is by configuring them using Spark configuration values at startup. Therefore a new config key has been added called:
spark.connect.grpc.interceptor.classes. This config value contains a comma-separated list of classes that are added as interceptors to the system.During startup all of the interceptors are added in order to the
NettyServerBuilder.Why are the changes needed?
Provide a configurable and extensible way to configure interceptors.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit Tests