-
Notifications
You must be signed in to change notification settings - Fork 614
Description
Use case
I'm using PySpark with Clickhouse JDBC. ETL processes in my company use spark.jars.packages config option to download Maven packages - it is very convenient, users don't have to download .jar files manually or hardcode URLs with Maven Central. Repo URL can be changed any time by altering spark.jars.ivySettings.
For v0.3.2 this was as simple as:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("spark_app_onetl_demo")
.config("spark.jars.packages", "ru.yandex.clickhouse:clickhouse-jdbc:0.3.2")
.getOrCreate()
)In 0.4.0 (#1134) some dependencies were made optional, and moved to new http and all classifiers. But Spark cannot use classifiers in package names, so this code breaks:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("spark_app_onetl_demo")
.config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:http:0.4.0")
.getOrCreate()
)Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.clickhouse:clickhouse-jdbc:all:0.4.0
For 0.4.0-0.7.2 I was able to use a workaround by manually downloading the same dependencies from http classifier:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("spark_app_onetl_demo")
.config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.4.0:com.clickhouse:clickhouse-http-client:0.4.0")
.getOrCreate()
)
spark = (
SparkSession.builder
.appName("spark_app_onetl_demo")
.config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.7.2:com.clickhouse:clickhouse-http-client:0.7.2:org.apache.httpcomponents.client5:httpclient5:5.4.2")
.getOrCreate()
)But this doesn't work with 0.9.2. First of all, org.ow2.asm:asm (compile time dependency) cannot be downloaded for some reason (probably artifact name is incompatible with Ivy2):
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("spark_app_onetl_demo")
.config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.9.2:com.clickhouse:clickhouse-http-client:0.9.2:org.apache.httpcomponents.client5:httpclient5:5.4.2")
.getOrCreate()
)Cannot download org.ow2.asm:asm:9.7
Ivy Default Cache set to: /home/maxim/.ivy2/cache
The jars for the packages stored in: /home/maxim/.ivy2/jars
com.clickhouse#clickhouse-jdbc added as a dependency
com.clickhouse#clickhouse-http-client added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3bd0d1ff-033f-4a4a-95e0-6192ced0b746;1.0
confs: [default]
found com.clickhouse#clickhouse-jdbc;0.9.2 in central
found com.clickhouse#clickhouse-client;0.9.2 in central
found org.apache.commons#commons-compress;1.27.1 in central
found commons-io#commons-io;2.16.1 in main
found org.apache.commons#commons-lang3;3.18.0 in local-maven-cache
found com.clickhouse#clickhouse-data;0.9.2 in central
found com.clickhouse#clickhouse-http-client;0.9.2 in central
found org.apache.httpcomponents.client5#httpclient5;5.4.4 in local-maven-cache
found org.apache.httpcomponents.core5#httpcore5;5.3.4 in local-maven-cache
found org.apache.httpcomponents.core5#httpcore5-h2;5.3.4 in local-maven-cache
found com.clickhouse#jdbc-v2;0.9.2 in central
found org.roaringbitmap#RoaringBitmap;0.9.47 in central
found org.ow2.asm#asm;9.7 in local-maven-cache
found com.google.guava#failureaccess;1.0.3 in local-maven-cache
found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in local-maven-cache
found org.jspecify#jspecify;1.0.0 in local-maven-cache
found com.google.errorprone#error_prone_annotations;2.36.0 in local-maven-cache
found com.google.j2objc#j2objc-annotations;3.0.0 in local-maven-cache
found org.roaringbitmap#shims;0.9.47 in central
found com.clickhouse#client-v2;0.9.2 in central
found com.google.guava#guava;33.4.6-jre in local-maven-cache
:: resolution report :: resolve 1014ms :: artifacts dl 29ms
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 21 | 0 | 0 | 0 || 21 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
[NOT FOUND ] org.ow2.asm#asm;9.7!asm.jar (2ms)
==== local-maven-cache: tried
file:///home/maxim/.m2/repository/org/ow2/asm/asm/9.7/asm-9.7.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.ow2.asm#asm;9.7!asm.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [download failed: org.ow2.asm#asm;9.7!asm.jar]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1608)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:334)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Even if this dependency is excluded, JDBC driver doesn't properly work:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("spark_app_onetl_demo")
.config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.9.2:com.clickhouse:clickhouse-http-client:0.9.2:org.apache.httpcomponents.client5:httpclient5:5.4.2")
.config("spark.jars.excludes", "org.ow2.asm:asm")
.getOrCreate()
)antlr v4.12 version compatibility issue
: java.lang.ExceptionInInitializerError
at com.clickhouse.jdbc.internal.SqlParser.walkSql(SqlParser.java:34)
at com.clickhouse.jdbc.internal.SqlParser.parsePreparedStatement(SqlParser.java:28)
at com.clickhouse.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:379)
at com.clickhouse.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:171)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:65)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:241)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:37)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 4 (expected 3).
at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:187)
at com.clickhouse.jdbc.internal.ClickHouseLexer.<clinit>(ClickHouseLexer.java:2098)
... 26 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 4 (expected 3).
... 28 more
#2554 fixes that by shading antlr package, but this is done only in all classifier which cannot be used with PySpark spark.jars.packages.
Describe the solution you'd like
Please introduce new artifact clickhouse-jdbc-shaded:{VERSION} instead of/in combination of all classifier, so it can be used with spark.jars.packages.
Describe the alternatives you've considered
Someone tried to add support of Maven artifacts group:artifact:classifier:version to Spark several times since 2017, but with no success:
- https://issues.apache.org/jira/browse/SPARK-20075
- https://issues.apache.org/jira/browse/SPARK-22849
- https://issues.apache.org/jira/browse/SPARK-24287
Spark connector for Clickhouse is not suitable for my cases as well.