Skip to content

Standardize the way to ship new catalogs #1887

@jackye1995

Description

@jackye1995

This issue is opened to discuss about a standardized way to ship new catalogs, as we plan to ship GlueCatalog and NessieCatalog in upcoming release, and new catalogs like JDBC is also in progress to be added.

In the last community sync meeting, we discussed 2 ways:

  1. for each new catalog (if in a new module), add a new runtime module that bundles all additional dependencies.
  2. directly add the module in existing runtimes, such as iceberg-spark3-runtime as long as the jar size increase is reasonable.

For approach 1, compiling a runtime jar for shared usage seems to be not ideal. It will introduce duplicated class path issue on the user side, with the AWS client version used here potentially different from the one in the user's application. Another issue with this approach is that we will introduce a ton of runtime modules to iceberg as more catalogs are added, and this is not desired.

For approach 2, I did some experiments based on the current aws module, and the result was not good. When added to the spark3 runtime, the jar size increased from 18.9MB to 34.7MB, almost doubled. I checked all AWS dependencies, and even with all non-aws dependencies excluded, the added size was still over 10MB. I would imagine this situation very similar if we support for GCS and Azure are added in the future.

So the best way to go in open source seems to be not bundling any runtime jar, and a user can start the spark session by specifying the additional dependencies in the --packages flag:

spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,org.apache.iceberg:iceberg-aws:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
    --conf spark.sql.catalog.test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.test.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.test.warehouse=s3://some-bucket

By doing so, the user can freely choose the version of aws client, 2.15.40 here for example. And for users sensitive to jar size, they can cherry pick the aws client packages to bring in by themselves, instead of using the 250MB bundle.

Any thoughts? @rdblue @rymurr @yyanyy @giovannifumarola

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions