Skip to content

Provide shaded jar without classifier #2625

@dolfinus

Description

@dolfinus

Use case

I'm using PySpark with Clickhouse JDBC. ETL processes in my company use spark.jars.packages config option to download Maven packages - it is very convenient, users don't have to download .jar files manually or hardcode URLs with Maven Central. Repo URL can be changed any time by altering spark.jars.ivySettings.

For v0.3.2 this was as simple as:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("spark_app_onetl_demo")
    .config("spark.jars.packages", "ru.yandex.clickhouse:clickhouse-jdbc:0.3.2")
    .getOrCreate()
)

In 0.4.0 (#1134) some dependencies were made optional, and moved to new http and all classifiers. But Spark cannot use classifiers in package names, so this code breaks:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("spark_app_onetl_demo")
    .config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:http:0.4.0")
    .getOrCreate()
)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.clickhouse:clickhouse-jdbc:all:0.4.0

For 0.4.0-0.7.2 I was able to use a workaround by manually downloading the same dependencies from http classifier:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("spark_app_onetl_demo")
    .config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.4.0:com.clickhouse:clickhouse-http-client:0.4.0")
    .getOrCreate()
)

spark = (
    SparkSession.builder
    .appName("spark_app_onetl_demo")
    .config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.7.2:com.clickhouse:clickhouse-http-client:0.7.2:org.apache.httpcomponents.client5:httpclient5:5.4.2")
    .getOrCreate()
)

But this doesn't work with 0.9.2. First of all, org.ow2.asm:asm (compile time dependency) cannot be downloaded for some reason (probably artifact name is incompatible with Ivy2):

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("spark_app_onetl_demo")
    .config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.9.2:com.clickhouse:clickhouse-http-client:0.9.2:org.apache.httpcomponents.client5:httpclient5:5.4.2")
    .getOrCreate()
)
Cannot download org.ow2.asm:asm:9.7
Ivy Default Cache set to: /home/maxim/.ivy2/cache
The jars for the packages stored in: /home/maxim/.ivy2/jars
com.clickhouse#clickhouse-jdbc added as a dependency
com.clickhouse#clickhouse-http-client added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3bd0d1ff-033f-4a4a-95e0-6192ced0b746;1.0
        confs: [default]
        found com.clickhouse#clickhouse-jdbc;0.9.2 in central
        found com.clickhouse#clickhouse-client;0.9.2 in central
        found org.apache.commons#commons-compress;1.27.1 in central
        found commons-io#commons-io;2.16.1 in main
        found org.apache.commons#commons-lang3;3.18.0 in local-maven-cache
        found com.clickhouse#clickhouse-data;0.9.2 in central
        found com.clickhouse#clickhouse-http-client;0.9.2 in central
        found org.apache.httpcomponents.client5#httpclient5;5.4.4 in local-maven-cache
        found org.apache.httpcomponents.core5#httpcore5;5.3.4 in local-maven-cache
        found org.apache.httpcomponents.core5#httpcore5-h2;5.3.4 in local-maven-cache
        found com.clickhouse#jdbc-v2;0.9.2 in central
        found org.roaringbitmap#RoaringBitmap;0.9.47 in central
        found org.ow2.asm#asm;9.7 in local-maven-cache
        found com.google.guava#failureaccess;1.0.3 in local-maven-cache
        found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in local-maven-cache
        found org.jspecify#jspecify;1.0.0 in local-maven-cache
        found com.google.errorprone#error_prone_annotations;2.36.0 in local-maven-cache
        found com.google.j2objc#j2objc-annotations;3.0.0 in local-maven-cache
        found org.roaringbitmap#shims;0.9.47 in central
        found com.clickhouse#client-v2;0.9.2 in central
        found com.google.guava#guava;33.4.6-jre in local-maven-cache
:: resolution report :: resolve 1014ms :: artifacts dl 29ms
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   21  |   0   |   0   |   0   ||   21  |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                [NOT FOUND  ] org.ow2.asm#asm;9.7!asm.jar (2ms)

        ==== local-maven-cache: tried

          file:///home/maxim/.m2/repository/org/ow2/asm/asm/9.7/asm-9.7.jar

                ::::::::::::::::::::::::::::::::::::::::::::::

                ::              FAILED DOWNLOADS            ::

                :: ^ see resolution messages for details  ^ ::

                ::::::::::::::::::::::::::::::::::::::::::::::

                :: org.ow2.asm#asm;9.7!asm.jar

                ::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [download failed: org.ow2.asm#asm;9.7!asm.jar]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1608)
        at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:334)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Even if this dependency is excluded, JDBC driver doesn't properly work:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("spark_app_onetl_demo")
    .config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.9.2:com.clickhouse:clickhouse-http-client:0.9.2:org.apache.httpcomponents.client5:httpclient5:5.4.2")
    .config("spark.jars.excludes", "org.ow2.asm:asm")
    .getOrCreate()
)
antlr v4.12 version compatibility issue
: java.lang.ExceptionInInitializerError
        at com.clickhouse.jdbc.internal.SqlParser.walkSql(SqlParser.java:34)
        at com.clickhouse.jdbc.internal.SqlParser.parsePreparedStatement(SqlParser.java:28)
        at com.clickhouse.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:379)
        at com.clickhouse.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:171)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:65)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:241)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:37)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 4 (expected 3).
        at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:187)
        at com.clickhouse.jdbc.internal.ClickHouseLexer.<clinit>(ClickHouseLexer.java:2098)
        ... 26 more
Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 4 (expected 3).
        ... 28 more

#2554 fixes that by shading antlr package, but this is done only in all classifier which cannot be used with PySpark spark.jars.packages.

Describe the solution you'd like

Please introduce new artifact clickhouse-jdbc-shaded:{VERSION} instead of/in combination of all classifier, so it can be used with spark.jars.packages.

Describe the alternatives you've considered

Someone tried to add support of Maven artifacts group:artifact:classifier:version to Spark several times since 2017, but with no success:

Spark connector for Clickhouse is not suitable for my cases as well.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions