[SPARK-40857] [CONNECT] Enable configurable GPRC Interceptors. #38320

grundprinzip · 2022-10-20T11:50:19Z

What changes were proposed in this pull request?

To be able to modify the incoming requests for the Spark Connect GRPC service, for example to be able to translate metadata from the HTTP/2 request to values in the proto message the GRPC service needs to be configured using an interceptor.

This patch adds two ways to configure interceptors for the GRPC service. First, we can now configure interceptors in the SparkConnectInterceptorRegistry by adding a value to the interceptorChain like in the example below:

object SparkConnectInterceptorRegistry {

  // Contains the list of configured interceptors.
  private lazy val interceptorChain: Seq[InterceptorBuilder] = Seq(
    interceptor[LoggingInterceptor](classOf[LoggingInterceptor])
  )
  // ...
}

The second way to configure interceptors is by configuring them using Spark configuration values at startup. Therefore a new config key has been added called: spark.connect.grpc.interceptor.classes. This config value contains a comma-separated list of classes that are added as interceptors to the system.

./bin/pyspark --conf spark.connect.grpc.interceptor.classes=com.my.important.LoggingInterceptor

During startup all of the interceptors are added in order to the NettyServerBuilder.

// Add all registered interceptors to the server builder.
SparkConnectInterceptorRegistry.chainInterceptors(sb)

Why are the changes needed?

Provide a configurable and extensible way to configure interceptors.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit Tests

grundprinzip · 2022-10-20T11:50:51Z

R: @cloud-fan @amaliujia @HyukjinKwon

AmplabJenkins · 2022-10-20T16:22:49Z

Can one of the admins verify this patch?

connector/connect/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala

...or/connect/src/test/scala/org/apache/spark/sql/connect/service/InterceptorRegistryTest.scala

core/src/main/resources/error/error-classes.json

amaliujia · 2022-10-21T18:40:33Z

LGTM

...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala

cloud-fan · 2022-10-24T15:09:03Z

thanks, merging to master!

HyukjinKwon · 2022-10-25T02:36:43Z

...ct/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectInterceptorRegistry.scala

+ * added to the GRPC server in order of their position in the list. Once the statically compiled
+ * interceptors are added, dynamically configured interceptors are added.
+ */
+object SparkConnectInterceptorRegistry {


Is this an API? We should mark @Unstable and added version.

Or it has to be private[service]. Or we should at least mention what are supposed to be an API at src/main/scala/org/apache/spark/sql/connect/package.scala

I thought everything under org.apache.spark.sql.connect.service is private, no?

It is not .. unless we document so ... Should probably either document it like https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala or explicitly make it private/public.

let's document it, just like what catalyst does. cc @amaliujia

makes sense. Let me check places that is intentional to be private or internal API but are documented. I can follow up on it.

HyukjinKwon · 2022-10-25T02:40:54Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

      .intConf
      .createWithDefault(15002)

+  val CONNECT_GRPC_INTERCEPTOR_CLASSES =


Just realized that this is under apach.spark.sql. ... we should either move this module out of sql or use StaticSQLConf ideally.

This is in the connect module.

but it's under apache.spark.sql.connect. Should probably move it to apache.spark.connect?

right now, everything in Spark connect is under apache.spark.sql.connect. Are you proposing a overall package renaming?

Yeah, I am proposing before it's too late.

Either: if we target to cover other components too, should probably rename them before it's too late. For PySpark too, should probably move it from pyspark.sql.connect.DataFrame -> pyspark.connect.sql.DataFrame.

Or, use StaticSQLConf since we're in SQL package. We're doing this in Hive thirft server, Hive modules, etc.

Then we should better use StaticSQLConf or SQLConf instead of SparkConf.

My point is that it's mixed. It's in SQL package but the configuration being used is SparkConf. The configuration name doesn't follow it either.

StaticSQLConf and SQLConf are in the SQL module, it's weird to add spark connect configs there...

That's what Hive thriftserver (separate module) dose. Avro (separate module) and Kafka (separate module for Structured Streaming) do. Pandas API on Spark also leverages runtime configurations via SparkSession under the hood instead of SparkConf.

This is weird that it's a SQL thing uses SQL package namespace but it doesn't use SQLConf.

I'm not sure what's the benefit of doing so. Config definition can be anywhere and we can still use SQLConf to access it. e.g. SQLConf.get.getConf(CONNECT_GRPC_INTERCEPTOR_CLASSES).

HyukjinKwon

LGTM. couple of comments

### What changes were proposed in this pull request? To be able to modify the incoming requests for the Spark Connect GRPC service, for example to be able to translate metadata from the HTTP/2 request to values in the proto message the GRPC service needs to be configured using an interceptor. This patch adds two ways to configure interceptors for the GRPC service. First, we can now configure interceptors in the `SparkConnectInterceptorRegistry` by adding a value to the `interceptorChain` like in the example below: ``` object SparkConnectInterceptorRegistry { // Contains the list of configured interceptors. private lazy val interceptorChain: Seq[InterceptorBuilder] = Seq( interceptor[LoggingInterceptor](classOf[LoggingInterceptor]) ) // ... } ``` The second way to configure interceptors is by configuring them using Spark configuration values at startup. Therefore a new config key has been added called: `spark.connect.grpc.interceptor.classes`. This config value contains a comma-separated list of classes that are added as interceptors to the system. ``` ./bin/pyspark --conf spark.connect.grpc.interceptor.classes=com.my.important.LoggingInterceptor ``` During startup all of the interceptors are added in order to the `NettyServerBuilder`. ``` // Add all registered interceptors to the server builder. SparkConnectInterceptorRegistry.chainInterceptors(sb) ``` ### Why are the changes needed? Provide a configurable and extensible way to configure interceptors. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit Tests Closes apache#38320 from grundprinzip/SPARK-40857. Lead-authored-by: Martin Grund <martin.grund@databricks.com> Co-authored-by: Martin Grund <grundprinzip@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

grpc interceptors

b738d09

github-actions bot added CONNECT CORE SQL labels Oct 20, 2022

grundprinzip added 2 commits October 20, 2022 14:51

fix trim vs strip

bdac014

style

8c12127

grundprinzip and others added 2 commits October 20, 2022 21:47

formatting

c544e3c

Merge branch 'apache:master' into SPARK-40857

c90d235

amaliujia reviewed Oct 21, 2022

View reviewed changes

grundprinzip added 5 commits October 21, 2022 10:46

review comments

ebf1e4c

style

add1e82

Merge branch 'SPARK-40857' of github.com:grundprinzip/spark into HEAD

e27f118

style

5952fe3

errors json

25ac334

cloud-fan reviewed Oct 24, 2022

View reviewed changes