[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC driver

pan3793 · dongjoon-hyun · commit efd066e5394a · 2025-10-29T12:57:28.000-07:00
### What changes were proposed in this pull request? This PR modifies the classpath for `bin/beeline` - excluding `spark-sql-core_*.jar`, `spark-connect_*.jar`, etc., and adding `jars/connect-repl/*.jar`, making it as same as `bin/spark-connect-shell`, the modified classpath looks like ``` jars/*.jar - except for spark-sql-core_*.jar, spark-connect_*.jar, etc. jars/connect-repl/*.jar - including spark-connect-client-jdbc_*.jar ``` Note: BeeLine itself only requires Hive jars and a few third-party utilities jars to run, excluding some `spark-*.jar`s won't break BeeLine's existing capability for connecting to Thrift Server. To ensure no change for classic Spark behavior, for Spark classic(default) distribution, the above changes only take effect when setting `SPARK_CONNECT_BEELINE=1` explicitly. For convenience, this is enabled by default for the Spark connect distribution ### Why are the changes needed? It's a new feature, with this feature, users are allowed to use BeeLine as an SQL CLI to connect to the Spark Connect server. ### Does this PR introduce _any_ user-facing change? No, this feature must be enabled by setting `SPARK_CONNECT_BEELINE=1` explicitly for classic(default) Spark distribution. ### How was this patch tested? Launch a Connect Server first, in my case, the Connect Server (v4.1.0-preview2) runs at `sc://localhost:15002`. To ensure changes won't break the Thrift Server use case, also launch a Thrift Server at `thrift://localhost:10000` #### Testing for dev mode Building ``` $ build/sbt -Phive,hive-thriftserver clean package ``` Without setting `SPARK_CONNECT_BEELINE=1`, it fails as expected with `No known driver to handle "jdbc:sc://localhost:15002"` ``` $ SPARK_PREPEND_CLASSES=true bin/beeline -u jdbc:sc://localhost:15002 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. WARNING: Using incubator modules: jdk.incubator.vector scan complete in 0ms scan complete in 4ms No known driver to handle "jdbc:sc://localhost:15002" Beeline version 2.3.10 by Apache Hive beeline> ``` With setting `SPARK_CONNECT_BEELINE=1`, it works as expected ``` $ SPARK_PREPEND_CLASSES=true SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:sc://localhost:15002 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. WARNING: Using incubator modules: jdk.incubator.vector Connecting to jdbc:sc://localhost:15002 Connected to: Apache Spark Connect Server (version 4.1.0-preview2) Driver: Apache Spark Connect JDBC Driver (version 4.1.0-SNAPSHOT) Error: Requested transaction isolation level REPEATABLE_READ is not supported (state=,code=0) Beeline version 2.3.10 by Apache Hive 0: jdbc:sc://localhost:15002> select 'Hello, Spark Connect!', version() as server_version; +------------------------+-------------------------------------------------+ | Hello, Spark Connect! | server_version | +------------------------+-------------------------------------------------+ | Hello, Spark Connect! | 4.1.0 c5ff48c | +------------------------+-------------------------------------------------+ 1 row selected (0.476 seconds) 0: jdbc:sc://localhost:15002> ``` Also, test with Thrift Server to ensure no impact on existing functionalities. It works as expected both with and without `SPARK_CONNECT_BEELINE=1` ``` $ SPARK_PREPEND_CLASSES=true [SPARK_CONNECT_BEELINE=1] bin/beeline -u jdbc:hive2://localhost:10000 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. WARNING: Using incubator modules: jdk.incubator.vector Connecting to jdbc:hive2://localhost:10000 Connected to: Spark SQL (version 4.1.0-preview2) Driver: Hive JDBC (version 2.3.10) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 2.3.10 by Apache Hive 0: jdbc:hive2://localhost:10000> select 'Hello, Spark Connect!', version() as server_version; +------------------------+-------------------------------------------------+ | Hello, Spark Connect! | server_version | +------------------------+-------------------------------------------------+ | Hello, Spark Connect! | 4.1.0 c5ff48c | +------------------------+-------------------------------------------------+ 1 row selected (0.973 seconds) 0: jdbc:hive2://localhost:10000> ``` #### Testing for Spark distribution ``` $ dev/make-distribution.sh --tgz --connect --name SPARK-54002 -Pyarn -Pkubernetes -Phadoop-3 -Phive -Phive-thriftserver ``` ##### Spark classic distribution ``` $ tar -xzf spark-4.1.0-SNAPSHOT-bin-SPARK-54002.tgz $ cd spark-4.1.0-SNAPSHOT-bin-SPARK-54002 $ bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;" ... (negative result, fails with 'No known driver to handle "jdbc:sc://localhost:15002"') $ SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;" ... (positive result) $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;" ... (positive result) $ SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;" ... (positive result) ``` ##### Spark connect distribution ``` $ tar -xzf spark-4.1.0-SNAPSHOT-bin-SPARK-54002-connect.tgz $ cd spark-4.1.0-SNAPSHOT-bin-SPARK-54002-connect $ bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;" ... (positive result) $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;" ... (positive result) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52706 from pan3793/SPARK-54002. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
@@ -322,9 +322,11 @@ if [ "$MAKE_TGZ" == "true" ]; then
     rm -rf "$TARDIR"
     cp -r "$DISTDIR" "$TARDIR"
     # Set the Spark Connect system variable in these scripts to enable it by default.
+    awk 'NR==1{print; print "export SPARK_CONNECT_BEELINE=${SPARK_CONNECT_BEELINE:-1}"; next} {print}' "$TARDIR/bin/beeline" > tmp && cat tmp > "$TARDIR/bin/beeline"
     awk 'NR==1{print; print "export SPARK_CONNECT_MODE=${SPARK_CONNECT_MODE:-1}"; next} {print}' "$TARDIR/bin/pyspark" > tmp && cat tmp > "$TARDIR/bin/pyspark"
     awk 'NR==1{print; print "export SPARK_CONNECT_MODE=${SPARK_CONNECT_MODE:-1}"; next} {print}' "$TARDIR/bin/spark-shell" > tmp && cat tmp > "$TARDIR/bin/spark-shell"
     awk 'NR==1{print; print "export SPARK_CONNECT_MODE=${SPARK_CONNECT_MODE:-1}"; next} {print}' "$TARDIR/bin/spark-submit" > tmp && cat tmp > "$TARDIR/bin/spark-submit"
+    awk 'NR==1{print; print "if [%SPARK_CONNECT_BEELINE%] == [] set SPARK_CONNECT_BEELINE=1"; next} {print}' "$TARDIR/bin/beeline.cmd" > tmp && cat tmp > "$TARDIR/bin/beeline.cmd"
     awk 'NR==1{print; print "if [%SPARK_CONNECT_MODE%] == [] set SPARK_CONNECT_MODE=1"; next} {print}' "$TARDIR/bin/pyspark2.cmd" > tmp && cat tmp > "$TARDIR/bin/pyspark2.cmd"
     awk 'NR==1{print; print "if [%SPARK_CONNECT_MODE%] == [] set SPARK_CONNECT_MODE=1"; next} {print}' "$TARDIR/bin/spark-shell2.cmd" > tmp && cat tmp > "$TARDIR/bin/spark-shell2.cmd"
     awk 'NR==1{print; print "if [%SPARK_CONNECT_MODE%] == [] set SPARK_CONNECT_MODE=1"; next} {print}' "$TARDIR/bin/spark-submit2.cmd" > tmp && cat tmp > "$TARDIR/bin/spark-submit2.cmd"
diff --git a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
@@ -66,6 +66,8 @@ abstract class AbstractCommandBuilder {
    */
   protected boolean isRemote = System.getenv().containsKey("SPARK_REMOTE");
 
+  protected boolean isBeeLine = false;
+
   AbstractCommandBuilder() {
     this.appArgs = new ArrayList<>();
     this.childEnv = new HashMap<>();
@@ -195,6 +197,10 @@ List<String> buildClassPath(String appClassPath) throws IOException {
           if (isRemote && "1".equals(getenv("SPARK_SCALA_SHELL")) && project.equals("sql/core")) {
             continue;
           }
+          if (isBeeLine && "1".equals(getenv("SPARK_CONNECT_BEELINE")) &&
+              project.equals("sql/core")) {
+            continue;
+          }
           // SPARK-49534: The assumption here is that if `spark-hive_xxx.jar` is not in the
           // classpath, then the `-Phive` profile was not used during package, and therefore
           // the Hive-related jars should also not be in the classpath. To avoid failure in
@@ -241,13 +247,13 @@ List<String> buildClassPath(String appClassPath) throws IOException {
         }
       }
 
-      if (isRemote) {
+      if (isRemote || (isBeeLine && "1".equals(getenv("SPARK_CONNECT_BEELINE")))) {
         for (File f: new File(jarsDir).listFiles()) {
-          // Exclude Spark Classic SQL and Spark Connect server jars
-          // if we're in Spark Connect Shell. Also exclude Spark SQL API and
-          // Spark Connect Common which Spark Connect client shades.
-          // Then, we add the Spark Connect shell and its dependencies in connect-repl
-          // See also SPARK-48936.
+          // Exclude Spark Classic SQL and Spark Connect server jars if we're in
+          // Spark Connect Shell or BeeLine with Connect JDBC driver. Also exclude
+          // Spark SQL API and Spark Connect Common which Spark Connect client shades.
+          // Then, we add the Spark Connect shell and its dependencies in connect-repl.
+          // See also SPARK-48936, SPARK-54002.
           if (f.isDirectory() && f.getName().equals("connect-repl")) {
             addToClassPath(cp, join(File.separator, f.toString(), "*"));
           } else if (
diff --git a/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java b/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java
@@ -38,6 +38,9 @@ class SparkClassCommandBuilder extends AbstractCommandBuilder {
   SparkClassCommandBuilder(String className, List<String> classArgs) {
     this.className = className;
     this.classArgs = classArgs;
+    if ("org.apache.hive.beeline.BeeLine".equals(className)) {
+      this.isBeeLine = true;
+    }
   }
 
   @Override

Original file line number	Diff line number	Diff line change
`@@ -38,6 +38,9 @@ class SparkClassCommandBuilder extends AbstractCommandBuilder {`
`38`	`38`	`SparkClassCommandBuilder(String className, List<String> classArgs) {`
`39`	`39`	`this.className = className;`
`40`	`40`	`this.classArgs = classArgs;`
	`41`	`+ if ("org.apache.hive.beeline.BeeLine".equals(className)) {`
	`42`	`+ this.isBeeLine = true;`
	`43`	`+ }`
`41`	`44`	`}`
`42`	`45`
`43`	`46`	`@Override`